Skip to left side bar
>
  • File
  • Edit
  • View
  • Run
  • Kernel
  • Git
  • Tabs
  • Settings
  • Help
managed-notebook-20230831-031214
502068828893-compute@developer.gserviceaccount.com

Open Tabs

  • users_and_tweets_data_based_10000_model_1_with_standardization_and_all_tweets_of_user.ipynb

Kernels

  • users_and_tweets_data_based_10000_model_0-15_1-15_1-2_with_standardization_and_all_tweets_of_user.ipynb

Terminals

    Google Cloud Storage
    /
    Name
    ...
    Last Modified
      Shared
      //data-analysis/users_and_tweets_data_based/
      Name
      ...
      Last Modified
      • copy_to_remove_users_and_tweet_data_based_model_1_with_standardization_single_tweet.ipynb4 days ago
      • copy_to_remove_users_and_tweets_data_based_10000_model_1_users_important_features_only_with_standardization_tweets_features_extracted_grouped_and_single_tweet_mean_acc.ipynb4 days ago
      • glove.6B.100d.txt9 years ago
      • glove.6B.200d.txt9 years ago
      • glove.6B.300d.txt9 years ago
      • glove.6B.50d.txt9 years ago
      • glove.6B.zip8 years ago
      • tweet_data_based_model_1_2_3_4_5_6_7_with_standardization_by_single_tweet.ipynb3 days ago
      • users_and_tweets_data_based_10000_model_0-15_1-15_1-2_with_standardization_and_all_tweets_of_user.ipynb11 minutes ago
      • users_and_tweets_data_based_10000_model_1_2_3_4_users_important_features_only_with_standardization_tweets_features_extracted_grouped.ipynb3 days ago
      • users_and_tweets_data_based_10000_model_1_with_standardization_and_all_tweets_of_user.ipynb11 hours ago
      BigQuery
      Resources
        • bigquery-public-data
        No items match your search.
      Query history
      • users_and_tweets_data_based_10000_model_1_with_standardization_and_all_tweets_of_user.ipynb
      Kernel status: Disconnected
      [2]:
       
      import pandas as pd
      import numpy as np
      import matplotlib.pyplot as plt
      import seaborn as sns
      import plotly.express as px
      import plotly.graph_objects as go
      from plotly.subplots import make_subplots
      from pandas.api.types import is_numeric_dtype
      from datetime import datetime
      [3]:
       
      !pip install tensorflow
      Requirement already satisfied: tensorflow in /home/jupyter/.local/lib/python3.7/site-packages (2.11.0)
      Requirement already satisfied: absl-py>=1.0.0 in /opt/conda/lib/python3.7/site-packages (from tensorflow) (1.4.0)
      Requirement already satisfied: astunparse>=1.6.0 in /home/jupyter/.local/lib/python3.7/site-packages (from tensorflow) (1.6.3)
      Collecting flatbuffers>=2.0 (from tensorflow)
        Obtaining dependency information for flatbuffers>=2.0 from https://files.pythonhosted.org/packages/6f/12/d5c79ee252793ffe845d58a913197bfa02ae9a0b5c9bc3dc4b58d477b9e7/flatbuffers-23.5.26-py2.py3-none-any.whl.metadata
        Using cached flatbuffers-23.5.26-py2.py3-none-any.whl.metadata (850 bytes)
      Requirement already satisfied: gast<=0.4.0,>=0.2.1 in /home/jupyter/.local/lib/python3.7/site-packages (from tensorflow) (0.4.0)
      Requirement already satisfied: google-pasta>=0.1.1 in /home/jupyter/.local/lib/python3.7/site-packages (from tensorflow) (0.2.0)
      Requirement already satisfied: grpcio<2.0,>=1.24.3 in /opt/conda/lib/python3.7/site-packages (from tensorflow) (1.56.2)
      Requirement already satisfied: h5py>=2.9.0 in /home/jupyter/.local/lib/python3.7/site-packages (from tensorflow) (3.8.0)
      Requirement already satisfied: keras<2.12,>=2.11.0 in /home/jupyter/.local/lib/python3.7/site-packages (from tensorflow) (2.11.0)
      Collecting libclang>=13.0.0 (from tensorflow)
        Obtaining dependency information for libclang>=13.0.0 from https://files.pythonhosted.org/packages/ea/df/55525e489c43f9dbb6c8ea27d8a567b3dcd18a22f3c45483055f5ca6611d/libclang-16.0.6-py2.py3-none-manylinux2010_x86_64.whl.metadata
        Using cached libclang-16.0.6-py2.py3-none-manylinux2010_x86_64.whl.metadata (5.2 kB)
      Requirement already satisfied: numpy>=1.20 in /opt/conda/lib/python3.7/site-packages (from tensorflow) (1.21.6)
      Requirement already satisfied: opt-einsum>=2.3.2 in /home/jupyter/.local/lib/python3.7/site-packages (from tensorflow) (3.3.0)
      Requirement already satisfied: packaging in /opt/conda/lib/python3.7/site-packages (from tensorflow) (23.1)
      Requirement already satisfied: protobuf<3.20,>=3.9.2 in /home/jupyter/.local/lib/python3.7/site-packages (from tensorflow) (3.19.6)
      Requirement already satisfied: setuptools in /opt/conda/lib/python3.7/site-packages (from tensorflow) (68.0.0)
      Requirement already satisfied: six>=1.12.0 in /opt/conda/lib/python3.7/site-packages (from tensorflow) (1.16.0)
      Requirement already satisfied: tensorboard<2.12,>=2.11 in /home/jupyter/.local/lib/python3.7/site-packages (from tensorflow) (2.11.2)
      Collecting tensorflow-estimator<2.12,>=2.11.0 (from tensorflow)
        Using cached tensorflow_estimator-2.11.0-py2.py3-none-any.whl (439 kB)
      Collecting termcolor>=1.1.0 (from tensorflow)
        Using cached termcolor-2.3.0-py3-none-any.whl (6.9 kB)
      Requirement already satisfied: typing-extensions>=3.6.6 in /opt/conda/lib/python3.7/site-packages (from tensorflow) (4.7.1)
      Requirement already satisfied: wrapt>=1.11.0 in /opt/conda/lib/python3.7/site-packages (from tensorflow) (1.15.0)
      Collecting tensorflow-io-gcs-filesystem>=0.23.1 (from tensorflow)
        Obtaining dependency information for tensorflow-io-gcs-filesystem>=0.23.1 from https://files.pythonhosted.org/packages/a4/c0/f9ac791c3f6f58a343b350894a3e92d44e53d20d7cf205988279ebcbc6e5/tensorflow_io_gcs_filesystem-0.33.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata
        Using cached tensorflow_io_gcs_filesystem-0.33.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (14 kB)
      Requirement already satisfied: wheel<1.0,>=0.23.0 in /opt/conda/lib/python3.7/site-packages (from astunparse>=1.6.0->tensorflow) (0.41.1)
      Requirement already satisfied: google-auth<3,>=1.6.3 in /opt/conda/lib/python3.7/site-packages (from tensorboard<2.12,>=2.11->tensorflow) (2.22.0)
      Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in /home/jupyter/.local/lib/python3.7/site-packages (from tensorboard<2.12,>=2.11->tensorflow) (0.4.6)
      Requirement already satisfied: markdown>=2.6.8 in /home/jupyter/.local/lib/python3.7/site-packages (from tensorboard<2.12,>=2.11->tensorflow) (3.4.4)
      Requirement already satisfied: requests<3,>=2.21.0 in /opt/conda/lib/python3.7/site-packages (from tensorboard<2.12,>=2.11->tensorflow) (2.31.0)
      Collecting tensorboard-data-server<0.7.0,>=0.6.0 (from tensorboard<2.12,>=2.11->tensorflow)
        Using cached tensorboard_data_server-0.6.1-py3-none-manylinux2010_x86_64.whl (4.9 MB)
      Collecting tensorboard-plugin-wit>=1.6.0 (from tensorboard<2.12,>=2.11->tensorflow)
        Using cached tensorboard_plugin_wit-1.8.1-py3-none-any.whl (781 kB)
      Collecting werkzeug>=1.0.1 (from tensorboard<2.12,>=2.11->tensorflow)
        Using cached Werkzeug-2.2.3-py3-none-any.whl (233 kB)
      Requirement already satisfied: cachetools<6.0,>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from google-auth<3,>=1.6.3->tensorboard<2.12,>=2.11->tensorflow) (5.3.1)
      Requirement already satisfied: pyasn1-modules>=0.2.1 in /opt/conda/lib/python3.7/site-packages (from google-auth<3,>=1.6.3->tensorboard<2.12,>=2.11->tensorflow) (0.3.0)
      Requirement already satisfied: rsa<5,>=3.1.4 in /opt/conda/lib/python3.7/site-packages (from google-auth<3,>=1.6.3->tensorboard<2.12,>=2.11->tensorflow) (4.9)
      Requirement already satisfied: urllib3<2.0 in /opt/conda/lib/python3.7/site-packages (from google-auth<3,>=1.6.3->tensorboard<2.12,>=2.11->tensorflow) (1.26.16)
      Requirement already satisfied: requests-oauthlib>=0.7.0 in /opt/conda/lib/python3.7/site-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.12,>=2.11->tensorflow) (1.3.1)
      Requirement already satisfied: importlib-metadata>=4.4 in /opt/conda/lib/python3.7/site-packages (from markdown>=2.6.8->tensorboard<2.12,>=2.11->tensorflow) (4.11.4)
      Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.7/site-packages (from requests<3,>=2.21.0->tensorboard<2.12,>=2.11->tensorflow) (3.2.0)
      Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.7/site-packages (from requests<3,>=2.21.0->tensorboard<2.12,>=2.11->tensorflow) (3.4)
      Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.7/site-packages (from requests<3,>=2.21.0->tensorboard<2.12,>=2.11->tensorflow) (2023.7.22)
      Requirement already satisfied: MarkupSafe>=2.1.1 in /opt/conda/lib/python3.7/site-packages (from werkzeug>=1.0.1->tensorboard<2.12,>=2.11->tensorflow) (2.1.1)
      Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata>=4.4->markdown>=2.6.8->tensorboard<2.12,>=2.11->tensorflow) (3.15.0)
      Requirement already satisfied: pyasn1<0.6.0,>=0.4.6 in /opt/conda/lib/python3.7/site-packages (from pyasn1-modules>=0.2.1->google-auth<3,>=1.6.3->tensorboard<2.12,>=2.11->tensorflow) (0.5.0)
      Requirement already satisfied: oauthlib>=3.0.0 in /opt/conda/lib/python3.7/site-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.12,>=2.11->tensorflow) (3.2.2)
      Using cached flatbuffers-23.5.26-py2.py3-none-any.whl (26 kB)
      Using cached libclang-16.0.6-py2.py3-none-manylinux2010_x86_64.whl (22.9 MB)
      Using cached tensorflow_io_gcs_filesystem-0.33.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (2.4 MB)
      Installing collected packages: tensorboard-plugin-wit, libclang, flatbuffers, werkzeug, termcolor, tensorflow-io-gcs-filesystem, tensorflow-estimator, tensorboard-data-server
      Successfully installed flatbuffers-23.5.26 libclang-16.0.6 tensorboard-data-server-0.6.1 tensorboard-plugin-wit-1.8.1 tensorflow-estimator-2.11.0 tensorflow-io-gcs-filesystem-0.33.0 termcolor-2.3.0 werkzeug-2.2.3
      
      [4]:
       
      import tensorflow as tf
      2023-09-03 23:40:38.987825: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
      To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
      2023-09-03 23:40:53.885297: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda/lib:/usr/local/lib/x86_64-linux-gnu:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
      2023-09-03 23:40:53.887013: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda/lib:/usr/local/lib/x86_64-linux-gnu:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
      2023-09-03 23:40:53.887041: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
      
      [5]:
       
      pd.options.mode.chained_assignment = None 
      [6]:
       
      !pip install keras
      Requirement already satisfied: keras in /home/jupyter/.local/lib/python3.7/site-packages (2.11.0)
      
      [7]:
       
      !pip install scikeras
      Collecting scikeras
        Using cached scikeras-0.10.0-py3-none-any.whl (27 kB)
      Requirement already satisfied: importlib-metadata>=3 in /opt/conda/lib/python3.7/site-packages (from scikeras) (4.11.4)
      Requirement already satisfied: packaging>=0.21 in /opt/conda/lib/python3.7/site-packages (from scikeras) (23.1)
      Requirement already satisfied: scikit-learn>=1.0.0 in /opt/conda/lib/python3.7/site-packages (from scikeras) (1.0.2)
      Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata>=3->scikeras) (3.15.0)
      Requirement already satisfied: typing-extensions>=3.6.4 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata>=3->scikeras) (4.7.1)
      Requirement already satisfied: numpy>=1.14.6 in /opt/conda/lib/python3.7/site-packages (from scikit-learn>=1.0.0->scikeras) (1.21.6)
      Requirement already satisfied: scipy>=1.1.0 in /opt/conda/lib/python3.7/site-packages (from scikit-learn>=1.0.0->scikeras) (1.7.3)
      Requirement already satisfied: joblib>=0.11 in /opt/conda/lib/python3.7/site-packages (from scikit-learn>=1.0.0->scikeras) (1.3.1)
      Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from scikit-learn>=1.0.0->scikeras) (3.1.0)
      Installing collected packages: scikeras
      Successfully installed scikeras-0.10.0
      
      [8]:
       
      import keras
      from keras.models import Sequential, Model
      from keras.layers import Input, Dense, Activation, Dropout, Flatten, Embedding, LSTM, Concatenate, Reshape, Bidirectional, SimpleRNN
      from keras.layers.convolutional import Conv1D, Conv2D, MaxPooling1D, MaxPooling2D
      from keras.callbacks import ModelCheckpoint, EarlyStopping
      [9]:
       
      import sklearn
      from sklearn.neighbors import NearestNeighbors
      from sklearn.preprocessing import MinMaxScaler
      from sklearn.preprocessing import StandardScaler
      from sklearn.model_selection import train_test_split
      from sklearn.metrics import accuracy_score
      from sklearn.metrics import precision_score
      from sklearn.metrics import recall_score
      from sklearn.metrics import f1_score
      from sklearn.metrics import roc_auc_score
      [10]:
       
      !pip install livelossplot
      from livelossplot.tf_keras import PlotLossesCallback
      Collecting livelossplot
        Using cached livelossplot-0.5.5-py3-none-any.whl (22 kB)
      Requirement already satisfied: matplotlib in /opt/conda/lib/python3.7/site-packages (from livelossplot) (3.5.3)
      Collecting bokeh (from livelossplot)
        Using cached bokeh-2.4.3-py3-none-any.whl (18.5 MB)
      Requirement already satisfied: ipython==7.* in /opt/conda/lib/python3.7/site-packages (from livelossplot) (7.33.0)
      Requirement already satisfied: numpy<1.22 in /opt/conda/lib/python3.7/site-packages (from livelossplot) (1.21.6)
      Requirement already satisfied: setuptools>=18.5 in /opt/conda/lib/python3.7/site-packages (from ipython==7.*->livelossplot) (68.0.0)
      Requirement already satisfied: jedi>=0.16 in /opt/conda/lib/python3.7/site-packages (from ipython==7.*->livelossplot) (0.19.0)
      Requirement already satisfied: decorator in /opt/conda/lib/python3.7/site-packages (from ipython==7.*->livelossplot) (5.1.1)
      Requirement already satisfied: pickleshare in /opt/conda/lib/python3.7/site-packages (from ipython==7.*->livelossplot) (0.7.5)
      Requirement already satisfied: traitlets>=4.2 in /opt/conda/lib/python3.7/site-packages (from ipython==7.*->livelossplot) (5.9.0)
      Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from ipython==7.*->livelossplot) (3.0.39)
      Requirement already satisfied: pygments in /opt/conda/lib/python3.7/site-packages (from ipython==7.*->livelossplot) (2.16.1)
      Requirement already satisfied: backcall in /opt/conda/lib/python3.7/site-packages (from ipython==7.*->livelossplot) (0.2.0)
      Requirement already satisfied: matplotlib-inline in /opt/conda/lib/python3.7/site-packages (from ipython==7.*->livelossplot) (0.1.6)
      Requirement already satisfied: pexpect>4.3 in /opt/conda/lib/python3.7/site-packages (from ipython==7.*->livelossplot) (4.8.0)
      Requirement already satisfied: Jinja2>=2.9 in /opt/conda/lib/python3.7/site-packages (from bokeh->livelossplot) (3.1.2)
      Requirement already satisfied: packaging>=16.8 in /opt/conda/lib/python3.7/site-packages (from bokeh->livelossplot) (23.1)
      Requirement already satisfied: pillow>=7.1.0 in /opt/conda/lib/python3.7/site-packages (from bokeh->livelossplot) (9.5.0)
      Requirement already satisfied: PyYAML>=3.10 in /opt/conda/lib/python3.7/site-packages (from bokeh->livelossplot) (6.0.1)
      Requirement already satisfied: tornado>=5.1 in /opt/conda/lib/python3.7/site-packages (from bokeh->livelossplot) (6.2)
      Requirement already satisfied: typing-extensions>=3.10.0 in /opt/conda/lib/python3.7/site-packages (from bokeh->livelossplot) (4.7.1)
      Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.7/site-packages (from matplotlib->livelossplot) (0.11.0)
      Requirement already satisfied: fonttools>=4.22.0 in /opt/conda/lib/python3.7/site-packages (from matplotlib->livelossplot) (4.38.0)
      Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/lib/python3.7/site-packages (from matplotlib->livelossplot) (1.4.4)
      Requirement already satisfied: pyparsing>=2.2.1 in /opt/conda/lib/python3.7/site-packages (from matplotlib->livelossplot) (3.1.1)
      Requirement already satisfied: python-dateutil>=2.7 in /opt/conda/lib/python3.7/site-packages (from matplotlib->livelossplot) (2.8.2)
      Requirement already satisfied: parso<0.9.0,>=0.8.3 in /opt/conda/lib/python3.7/site-packages (from jedi>=0.16->ipython==7.*->livelossplot) (0.8.3)
      Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/lib/python3.7/site-packages (from Jinja2>=2.9->bokeh->livelossplot) (2.1.1)
      Requirement already satisfied: ptyprocess>=0.5 in /opt/conda/lib/python3.7/site-packages (from pexpect>4.3->ipython==7.*->livelossplot) (0.7.0)
      Requirement already satisfied: wcwidth in /opt/conda/lib/python3.7/site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython==7.*->livelossplot) (0.2.6)
      Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.7/site-packages (from python-dateutil>=2.7->matplotlib->livelossplot) (1.16.0)
      Installing collected packages: bokeh, livelossplot
      Successfully installed bokeh-2.4.3 livelossplot-0.5.5
      
      [11]:
       
      !pip install shap
      import shap
      Collecting shap
        Obtaining dependency information for shap from https://files.pythonhosted.org/packages/b8/d8/15066ae71ba63683b8e53a8bef0e75bd87e95b79ef293f63fa674b351d9b/shap-0.42.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
        Using cached shap-0.42.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (23 kB)
      Requirement already satisfied: numpy in /opt/conda/lib/python3.7/site-packages (from shap) (1.21.6)
      Requirement already satisfied: scipy in /opt/conda/lib/python3.7/site-packages (from shap) (1.7.3)
      Requirement already satisfied: scikit-learn in /opt/conda/lib/python3.7/site-packages (from shap) (1.0.2)
      Requirement already satisfied: pandas in /opt/conda/lib/python3.7/site-packages (from shap) (1.3.5)
      Requirement already satisfied: tqdm>=4.27.0 in /opt/conda/lib/python3.7/site-packages (from shap) (4.63.0)
      Requirement already satisfied: packaging>20.9 in /opt/conda/lib/python3.7/site-packages (from shap) (23.1)
      Collecting slicer==0.0.7 (from shap)
        Using cached slicer-0.0.7-py3-none-any.whl (14 kB)
      Requirement already satisfied: numba in /opt/conda/lib/python3.7/site-packages (from shap) (0.56.4)
      Requirement already satisfied: cloudpickle in /opt/conda/lib/python3.7/site-packages (from shap) (2.2.1)
      Requirement already satisfied: llvmlite<0.40,>=0.39.0dev0 in /opt/conda/lib/python3.7/site-packages (from numba->shap) (0.39.1)
      Requirement already satisfied: setuptools in /opt/conda/lib/python3.7/site-packages (from numba->shap) (68.0.0)
      Requirement already satisfied: importlib-metadata in /opt/conda/lib/python3.7/site-packages (from numba->shap) (4.11.4)
      Requirement already satisfied: python-dateutil>=2.7.3 in /opt/conda/lib/python3.7/site-packages (from pandas->shap) (2.8.2)
      Requirement already satisfied: pytz>=2017.3 in /opt/conda/lib/python3.7/site-packages (from pandas->shap) (2023.3)
      Requirement already satisfied: joblib>=0.11 in /opt/conda/lib/python3.7/site-packages (from scikit-learn->shap) (1.3.1)
      Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from scikit-learn->shap) (3.1.0)
      Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.7/site-packages (from python-dateutil>=2.7.3->pandas->shap) (1.16.0)
      Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata->numba->shap) (3.15.0)
      Requirement already satisfied: typing-extensions>=3.6.4 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata->numba->shap) (4.7.1)
      Using cached shap-0.42.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (545 kB)
      Installing collected packages: slicer, shap
      Successfully installed shap-0.42.1 slicer-0.0.7
      
      Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
      
      [12]:
       
      from google.cloud import bigquery
      xxxxxxxxxx

      xxxxxxxxxx

      Authentication for working from Google Colab¶

      [13]:
       
      # from google.colab import auth
      # auth.authenticate_user()
      xxxxxxxxxx

      xxxxxxxxxx

      Destination to save trained models¶

      xxxxxxxxxx

      Google disk¶

      [14]:
       
      # from google.colab import drive
      ​
      # drive.mount('/content/gdrive')
      # project_folder_path = '/content/gdrive/Shareddrives/Magisterka/PROJEKT/'
      # models_path = project_folder_path + '/models'
      xxxxxxxxxx

      Vertex AI Jupyter Lab¶

      [15]:
       
      data_analysis_folder_path = '../'
      models_path = data_analysis_folder_path + '/models'
      xxxxxxxxxx

      xxxxxxxxxx

      Connect to Bigquery service¶

      [16]:
       
      import sys
      sys.path.append("./../../")
      from gcp_env import PROJECT_ID, LOCATION
      [17]:
       
      project_id = PROJECT_ID # Fill project id
      bqclient = bigquery.Client(location=LOCATION, project=project_id)
      xxxxxxxxxx

      Users data¶

      xxxxxxxxxx

      xxxxxxxxxx

      Loading data¶

      [18]:
       
      dataset_name = "twitbot_22_preprocessed_common_users_ids"
      ​
      users_table_name = "users"
      BQ_TABLE_USERS = dataset_name + "." + users_table_name
      users_table_id = project_id + "." + BQ_TABLE_USERS
      [19]:
       
      # job_config = bigquery.QueryJobConfig(
      #     allow_large_results=True, destination=users_table_id, use_legacy_sql=True
      # )
      [20]:
       
      SQL_QUERY = f"""WITH 
        human_records AS (
          SELECT *, ROW_NUMBER() OVER () row_num 
          FROM {BQ_TABLE_USERS}
          WHERE label = 'human' 
          LIMIT 5000),
        bot_records AS (
        SELECT *, ROW_NUMBER() OVER () row_num 
          FROM {BQ_TABLE_USERS}
          WHERE label = 'bot' 
          LIMIT 5000)
        SELECT * FROM human_records 
          UNION ALL SELECT * 
          FROM bot_records 
          ORDER BY row_num;"""
      ​
      users_df1 = bqclient.query(SQL_QUERY).to_dataframe()
      users_df1 = users_df1.drop(['row_num'], axis=1)
      [21]:
       
      # LIMIT RESULTS OPTIONS
      pd.set_option('display.max_rows', 100)
      # pd.set_option('display.max_rows', None)
      pd.set_option('display.max_column', None)
      pd.set_option('display.max_colwidth', None)
      [22]:
       
      num_bots = len(users_df1.loc[users_df1['label']=='bot'])       # bots number
      num_humans = len(users_df1.loc[users_df1['label']=='human'])   # humans number
      ​
      print("Number of real users: ", num_humans)
      print("Number of bots: ", num_bots)
      Number of real users:  5000
      Number of bots:  5000
      
      [23]:
       
      org_users_df = pd.DataFrame(users_df1).copy()
      users_df2 = pd.DataFrame(org_users_df).copy()
      [24]:
       
      def filter_df_for_balanced_classes(df, bot_label_value='bot', human_label_value='human'):
          new_df = pd.DataFrame()
      ​
          i = 0 # bots iter.
          j = 0 # humans iter.
          k = 0
          num_bots = len(df.loc[df['label']==bot_label_value])
          num_humans = len(df.loc[df['label']==human_label_value])
          max_num = min(num_bots, num_humans)
          for index, record in df.iterrows():
            if k < (2*max_num):
              if record['label']==bot_label_value and i < max_num:
                new_df = new_df.append(record)
                # users_df = pd.concat([users_df, record], ignore_index=True)
                i += 1
                k += 1
              if record['label']==human_label_value and j < max_num:
                new_df = new_df.append(record)
                # users_df = pd.concat([users_df, record], ignore_index=True)
                j += 1
                k += 1
                  
          print("Number of bots: ", len(new_df.loc[new_df['label']==bot_label_value]))
          print("Number of human users: ", len(new_df.loc[new_df['label']==human_label_value]))
          
          return pd.DataFrame(new_df).copy();
      [25]:
       
      # users_df = filter_df_for_balanced_classes(users_df2)
      users_df = pd.DataFrame(users_df2).copy()
      xxxxxxxxxx

      Data preparation¶

      [26]:
       
      def drop_columns(df, columns):
          for column_name in columns:
            df = df.drop([column_name], axis=1)
          return df
      [27]:
       
      def encode_not_numeric_columns(df):
        for column_name in df:
          if not is_numeric_dtype(df[column_name]):
            unique_values_dict = dict(enumerate(df[column_name].unique()))
            unique_values_dict = dict((v, k) for k, v in unique_values_dict.items())
            df[column_name] = df[column_name].map(unique_values_dict)
        return df
      xxxxxxxxxx

      Align values for bool columns¶

      [28]:
       
      boolean_columns = ["verified", "protected", "withheld", "has_location", "has_profile_image_url", "has_pinned_tweet", "has_description"]
      [29]:
       
      # Firstly align boolean columns values
      for col_name in boolean_columns:
          users_df[col_name] = users_df[col_name].astype(bool)
      ​
      column_to_remove = []
      # Check unique values (some of subset can have only one unique value for some feature) if so it column will be removed from dataframe
      for col_name in boolean_columns:
          uniq_val_list = users_df[col_name].unique()
          print("Column {:<24} {}".format(col_name, str(uniq_val_list)))
          if (len(uniq_val_list) < 2):
              column_to_remove.append(col_name)
      Column verified                 [False  True]
      Column protected                [False  True]
      Column withheld                 [False]
      Column has_location             [ True False]
      Column has_profile_image_url    [ True False]
      Column has_pinned_tweet         [False  True]
      Column has_description          [ True False]
      
      [30]:
       
      column_to_remove
      [30]:
      ['withheld']
      [31]:
       
      # remove from bool columns:
      for col_name in column_to_remove:
          boolean_columns.remove(col_name)
      # remove from dataframe
      users_df = drop_columns(users_df, column_to_remove)
      xxxxxxxxxx

      Encoding of non-numeric information which will be used by model¶

      [32]:
       
      # Remap the values of the dataframe
      for col_name in boolean_columns:
        users_df[col_name] = users_df[col_name].map({True:1,False:0})
      ​
      # Remap label values human/bot for 0/1
      label_col = "label"
      users_df[label_col] = users_df[label_col].map({"human":0,"bot":1})
      [33]:
       
      users_df
      [33]:
      id label username name created_at verified protected has_location location has_profile_image_url has_pinned_tweet url followers_count following_count tweet_count listed_count has_description description descr_no_hashtags descr_no_cashtags descr_no_mentions descr_no_urls url_no_urls
      0 1428769922507751429 1 BotoxAesthetics dermalfillers Aesthetics botox 1629480285 0 0 1 London , United Kingdom 1 0 https://t.co/CBDBvXnRKv 2 41 1 0 1 Enhance fillers is a progressive company found in the city of Webminster,We offer a wide range of aesthetic services including Botox, Dysport, Xeomin,the Juvede 0 0 0 0 1
      1 1484544053572419585 0 blessing_xettry #Blessing xettry 1642777877 0 0 1 Nepal 1 0 0 24 1 0 1 Okay, well, maybe not forever. But at least until you make some changes. 0 0 0 0 0
      2 842202106324951040 1 Mark11474609 Mark 1489631604 0 0 1 Kelvin Grove, Brisbane 1 0 3 22 4 0 0 0 0 0 0 0
      3 1447956502443069446 0 menametaken winwinnie 1634054741 0 0 1 your walls 1 0 0 20 1 0 1 20 | uni student | life goes brrr \nyes I do and it's called art\n#thickthighssavelifes 1 0 0 0 0
      4 21309002 1 Sjouzan Zuzana 1235058272 0 0 1 Brighton, UK 1 0 3 42 2 0 0 0 0 0 0 0
      ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
      9995 3275187061 0 LZconcussion Concussion Recovery 1436585642 0 0 1 Park City, UT 1 1 https://t.co/KpjP54TOGR 352 190 1094 18 1 Specializing in #concussionmanagement including #education, #therapy options / recommendations & our standardized #ReturntoLifeandSport #exerciseprogression 5 0 0 0 1
      9996 1485289449487572996 1 davie73smith Davie 1642955586 0 0 0 None 1 0 0 34 0 0 1 F U N 0 0 0 0 0
      9997 1215382704876871680 0 USC_TrueVote USC Election Cybersecurity Initiative 1578604840 0 0 1 washington dc 1 0 https://t.co/jlreKFwVEc 608 1265 1095 11 1 Platform, party, and vendor-agnostic.\nOur candidate is democracy. ����\nTraining in all 50 states ✈ �� ��\nSupport from @google\nUpcoming Training Events �� 0 0 1 0 1
      9998 1480725883820208131 1 Theresa823 Theresa Coleman 1641867569 0 0 0 None 1 0 0 5 0 0 0 0 0 0 0 0
      9999 407458156 0 ManyDullKnives Many Dull Knives 1320722300 0 0 1 Toronto, Canada. 1 1 http://t.co/JyuusxBRDF 864 346 153532 47 1 The official Twitter account of the webcomic, Many Dull Knives, by @jHYtse. (This is a humourous strip and not about a cowardly cutter) 0 0 1 0 1

      10000 rows × 23 columns

      xxxxxxxxxx

      Null and NaN statistics¶

      [34]:
       
      for col_name in users_df:
          count1 = pd.isnull(users_df[col_name]).sum()
          print(col_name + ": " + str(count1))
      id: 0
      label: 0
      username: 0
      name: 0
      created_at: 0
      verified: 0
      protected: 0
      has_location: 0
      location: 3476
      has_profile_image_url: 0
      has_pinned_tweet: 0
      url: 0
      followers_count: 0
      following_count: 0
      tweet_count: 0
      listed_count: 0
      has_description: 0
      description: 0
      descr_no_hashtags: 0
      descr_no_cashtags: 0
      descr_no_mentions: 0
      descr_no_urls: 0
      url_no_urls: 0
      
      xxxxxxxxxx

      Extract some information from dataframe to new columns¶

      xxxxxxxxxx
      Description length¶
      [35]:
       
      users_df['descr_len'] = users_df['description'].apply(len).astype(float)
      xxxxxxxxxx
      Account age (in days) (sice 16.03.2022) (dataset data collected during the 20/01-15/03/2022 period)¶
      [36]:
       
      from datetime import datetime
      [37]:
       
      def cal_days_diff(a,b):
          A = a.replace(hour = 0, minute = 0, second = 0, microsecond = 0)
          B = b.replace(hour = 0, minute = 0, second = 0, microsecond = 0)
          return (A - B).days
      ​
      def convert_unixtime_to_datetime(a):
          return datetime.utcfromtimestamp(a)
      [38]:
       
      base_date = datetime(2022, 3, 16)
      users_df['account_age'] = users_df.apply(lambda x: cal_days_diff(base_date, convert_unixtime_to_datetime(x.created_at)), axis=1).astype(float)
      xxxxxxxxxx

      Reduce unnecessary columns¶

      [39]:
       
      # users_reduced_df = pd.DataFrame(users_df).copy()
      # # columns_to_drop = ["id", "username", "name", "created_at", "location", "url", "description"]
      # columns_to_drop = ["username", "name", "created_at", "location", "url", "description"]
      # users_reduced_df = drop_columns(users_reduced_df, columns_to_drop)
      # users_reduced_df
      xxxxxxxxxx

      Filter data, left column by feature importance based on SHAP results¶

      [40]:
       
      shap_features = ['followers_count', 'tweet_count', 'following_count', 'account_age', 'descr_len']
      [41]:
       
      users_reduced_df = users_df.copy()
      users_reduced_df = users_df.filter(['label', 'id']+shap_features)
      users_reduced_df
      [41]:
      label id followers_count tweet_count following_count account_age descr_len
      0 1 1428769922507751429 2 1 41 208.0 160.0
      1 0 1484544053572419585 0 1 24 54.0 72.0
      2 1 842202106324951040 3 4 22 1826.0 0.0
      3 0 1447956502443069446 0 1 20 155.0 85.0
      4 1 21309002 3 2 42 4773.0 0.0
      ... ... ... ... ... ... ... ...
      9995 0 3275187061 352 1094 190 2440.0 156.0
      9996 1 1485289449487572996 0 0 34 52.0 5.0
      9997 0 1215382704876871680 608 1095 1265 797.0 153.0
      9998 1 1480725883820208131 0 0 5 64.0 0.0
      9999 0 407458156 864 153532 346 3781.0 135.0

      10000 rows × 7 columns

      xxxxxxxxxx

      Data type conversion (to float)¶

      [42]:
       
      for (column_name, column_data) in users_reduced_df.iteritems():
          if (column_name != 'id'):
              users_reduced_df[column_name] = users_reduced_df[column_name].astype(float)
      xxxxxxxxxx

      Data split for training, validation and testing of users data¶

      [43]:
       
      train_users_data, test_users_data = train_test_split(users_reduced_df, test_size=0.30, random_state=25, shuffle=True)
      test_users_data, val_users_data = train_test_split(test_users_data, test_size=0.5, random_state=25, shuffle=True)
      xxxxxxxxxx

      Describe trainig dataset of users dataset¶

      [44]:
       
      train_users_data.describe()
      [44]:
      label followers_count tweet_count following_count account_age descr_len
      count 7000.000000 7.000000e+03 7.000000e+03 7000.000000 7000.000000 7000.000000
      mean 0.503857 6.229971e+03 6.554910e+03 1253.036286 2442.663000 84.592000
      std 0.500021 4.412925e+04 3.316229e+04 6121.951212 1640.465006 59.651674
      min 0.000000 0.000000e+00 0.000000e+00 0.000000 22.000000 0.000000
      25% 0.000000 3.300000e+01 2.200000e+01 74.000000 818.000000 23.000000
      50% 1.000000 2.710000e+02 5.015000e+02 269.000000 2407.000000 95.000000
      75% 1.000000 1.565500e+03 3.310500e+03 899.000000 3995.000000 143.000000
      max 1.000000 1.730667e+06 1.184641e+06 244195.000000 5724.000000 243.000000
      xxxxxxxxxx

      Describes training users data for bots¶

      [45]:
       
      train_users_data.loc[train_users_data['label']==1].describe()
      [45]:
      label followers_count tweet_count following_count account_age descr_len
      count 3527.0 3527.000000 3527.000000 3527.000000 3527.000000 3527.000000
      mean 1.0 2016.999716 2185.104338 770.499008 2060.185143 67.609300
      std 0.0 19503.794857 11279.654017 4195.024713 1565.122289 62.244412
      min 1.0 0.000000 0.000000 0.000000 30.000000 0.000000
      25% 1.0 14.000000 7.000000 41.000000 604.000000 0.000000
      50% 1.0 81.000000 127.000000 140.000000 1776.000000 58.000000
      75% 1.0 410.000000 1086.000000 431.000000 3446.500000 134.000000
      max 1.0 702018.000000 497641.000000 150720.000000 5484.000000 243.000000
      xxxxxxxxxx

      Describes training users data for humans¶

      [46]:
       
      train_users_data.loc[train_users_data['label']==0].describe()
      [46]:
      label followers_count tweet_count following_count account_age descr_len
      count 3473.0 3.473000e+03 3.473000e+03 3473.000000 3473.000000 3473.000000
      mean 0.0 1.050845e+04 1.099266e+04 1743.076303 2831.087820 101.838756
      std 0.0 5.918594e+04 4.526135e+04 7563.173323 1624.084521 51.457456
      min 0.0 0.000000e+00 0.000000e+00 0.000000 22.000000 0.000000
      25% 0.0 1.570000e+02 1.700000e+02 159.000000 1343.000000 63.000000
      50% 0.0 9.130000e+02 1.578000e+03 499.000000 3094.000000 115.000000
      75% 0.0 3.610000e+03 6.601000e+03 1413.000000 4342.000000 149.000000
      max 0.0 1.730667e+06 1.184641e+06 244195.000000 5724.000000 181.000000
      xxxxxxxxxx

      Data analysis¶

      xxxxxxxxxx

      Distribution of label class in training, validation and test set of users data¶

      [47]:
       
      stack_data = {'Set': ['Training data', 'Validation data', 'Test data', 'Training data', 'Validation data', 'Test data'],
                    'Label': ['Bot', 'Bot', 'Bot', 'Human', 'Human', 'Human'],
                    'Freq': [len(train_users_data.loc[train_users_data['label']==1]), 
                             len(val_users_data.loc[val_users_data['label']==1]), 
                             len(test_users_data.loc[test_users_data['label']==1]),
                             len(train_users_data.loc[train_users_data['label']==0]), 
                             len(val_users_data.loc[val_users_data['label']==0]), 
                             len(test_users_data.loc[test_users_data['label']==0])]}
      sdf = pd.DataFrame(stack_data)
      sdf
      [47]:
      Set Label Freq
      0 Training data Bot 3527
      1 Validation data Bot 743
      2 Test data Bot 730
      3 Training data Human 3473
      4 Validation data Human 757
      5 Test data Human 770
      [48]:
       
      fig = px.bar(sdf, x="Set", y="Freq",
                   color="Label", hover_data=['Label'],
                   barmode = 'group')
      fig.update_layout(
          title_text='Distribution of bot/human classes in training, validation and test dataset',
          xaxis_title_text='', #'subset',
          yaxis_title_text='frequency',
          bargap=0.05,
          bargroupgap=0.05,
          width=700,
          height=500,
          legend={"title":""})
      fig.show()
      Training dataValidation dataTest data0500100015002000250030003500
      BotHumanDistribution of bot/human classes in training, validation and test datasetfrequency
      plotly-logomark
      xxxxxxxxxx

      Distribution of other features in training dataset¶

      xxxxxxxxxx

      followers_count¶

      [49]:
       
      fig = go.Figure()
      fig = make_subplots(rows=1, cols=2, specs=[[{'type':'histogram'}, {'type':'histogram'}]])
      ​
      fig.add_trace(go.Histogram(
          x=train_users_data.loc[train_users_data['label']==1,'followers_count'],
          # histnorm='density',
          nbinsx=200,
          name='Bot'),
      row=1, col=1
      )
      fig.add_trace(go.Histogram(
          x=train_users_data.loc[train_users_data['label']==0,'followers_count'],
          # histnorm='density',
          nbinsx=200,
          name='Human'),
      row=1, col=2
      )
      ​
      fig.update_layout(
          title_text='Distribution of values of training dataset column: followers_count',
          xaxis_title_text='followers_count', #'feature',
          yaxis_title_text='frequency',
          bargap=0.5,
          bargroupgap=None, #0.8,
          width=1100,
          height=450,
          legend={"title":""},
          xaxis=dict(showgrid=True,title='followers_count', dtick=50000, range=[0, max(train_users_data.loc[train_users_data['label']==1,'followers_count'])+25000]),
          xaxis2=dict(showgrid=True, dtick=100000, range=[0, max(train_users_data.loc[train_users_data['label']==0,'followers_count'])+50000]),
          yaxis=dict(showgrid=True))
      ​
      fig.show()
      050k100k150k200k250k300k350k400k450k500k550k600k650k700k050010001500200025003000350000.1M0.2M0.3M0.4M0.5M0.6M0.7M0.8M0.9M1M1.1M1.2M1.3M1.4M1.5M1.6M1.7M050010001500200025003000
      BotHumanDistribution of values of training dataset column: followers_countfollowers_countfrequency
      plotly-logomark
      [50]:
       
      len(train_users_data[(train_users_data['label']==1)])
      [50]:
      3527
      [51]:
       
      len(train_users_data[(train_users_data['label']==0)])
      [51]:
      3473
      [52]:
       
      from scipy.stats import expon
      ​
      # Fit an exponential distribution to data
      loc_b, scale_b = expon.fit(train_users_data.loc[train_users_data['label']==1]['followers_count'])
      loc_h, scale_h = expon.fit(train_users_data.loc[train_users_data['label']==0]['followers_count'])
      ​
      # Calculate the 99th percentile using the percent-point function (inverse CDF)
      percentile_99_bots = expon.ppf(0.99, loc=loc_b, scale=scale_b)
      percentile_99_humans = expon.ppf(0.99, loc=loc_h, scale=scale_h)
      df_reduced_outliers_followers_count = train_users_data[((train_users_data['label']==1) & (train_users_data['followers_count'] < percentile_99_bots)) | ((train_users_data['label']==0) & (train_users_data['followers_count'] < percentile_99_humans))]
      df_filtered_bots = train_users_data[(train_users_data['label']==1) & (train_users_data['followers_count'] < percentile_99_bots)]
      df_filtered_humans = train_users_data[(train_users_data['label']==0) & (train_users_data['followers_count'] < percentile_99_humans)]
      [53]:
       
      def df_99_percentile(df, column_name):
          # Fit an exponential distribution to data
          loc_b, scale_b = expon.fit(df.loc[df['label']==1][column_name])
          loc_h, scale_h = expon.fit(df.loc[df['label']==0][column_name])
      ​
          # Calculate the 99th percentile using the percent-point function (inverse CDF)
          percentile_99_bots = expon.ppf(0.99, loc=loc_b, scale=scale_b)
          percentile_99_humans = expon.ppf(0.99, loc=loc_h, scale=scale_h)
          return df[((df['label']==1) & (df[column_name] < percentile_99_bots)) | ((df['label']==0) & (df[column_name] < percentile_99_humans))]
      xxxxxxxxxx

      following_count¶

      [54]:
       
      fig = go.Figure()
      fig = make_subplots(rows=1, cols=2, specs=[[{'type':'histogram'}, {'type':'histogram'}]])
      ​
      fig.add_trace(go.Histogram(
          x=train_users_data.loc[train_users_data['label']==1,'following_count'],
          # histnorm='density',
          nbinsx=200,
          name='Bot'),
      row=1, col=1
      )
      fig.add_trace(go.Histogram(
          x=train_users_data.loc[train_users_data['label']==0,'following_count'],
          # histnorm='density',
          nbinsx=200,
          name='Human'),
      row=1, col=2
      )
      ​
      fig.update_layout(
          title_text='Distribution of values of training dataset column: following_count',
          xaxis_title_text='following_count', #'feature',
          yaxis_title_text='frequency',
          bargap=0.5,
          bargroupgap=None, #0.8,
          width=1100,
          height=450,
          legend={"title":""},
          xaxis=dict(showgrid=True, dtick=10000, range=[0, max(train_users_data['following_count'])+5000]),
          xaxis2=dict(showgrid=True, dtick=10000, range=[0, max(train_users_data['following_count'])+5000]),
          yaxis=dict(showgrid=True))
      ​
      fig.show()
      010k20k30k40k50k60k70k80k90k100k110k120k130k140k150k160k170k180k190k200k210k220k230k240k050010001500200025003000010k20k30k40k50k60k70k80k90k100k110k120k130k140k150k160k170k180k190k200k210k220k230k240k050010001500200025003000
      BotHumanDistribution of values of training dataset column: following_countfollowing_countfrequency
      plotly-logomark
      xxxxxxxxxx

      tweet_count¶

      [55]:
       
      fig = go.Figure()
      fig = make_subplots(rows=1, cols=2, specs=[[{'type':'histogram'}, {'type':'histogram'}]])
      ​
      fig.add_trace(go.Histogram(
          x=train_users_data.loc[train_users_data['label']==1,'tweet_count'],
          # histnorm='density',
          nbinsx=200,
          name='Bot'),
      row=1, col=1
      )
      fig.add_trace(go.Histogram(
          x=train_users_data.loc[train_users_data['label']==0,'tweet_count'],
          # histnorm='density',
          nbinsx=200,
          name='Human'),
      row=1, col=2
      )
      ​
      fig.update_layout(
          title_text='Distribution of values of training dataset column: tweet_count',
          xaxis_title_text='tweet_count', #'feature',
          yaxis_title_text='frequency',
          bargap=0.5,
          bargroupgap=None, #0.8,
          width=1100,
          height=450,
          legend={"title":""},
          xaxis=dict(showgrid=True, dtick=20000),
          xaxis2=dict(showgrid=True, dtick=100000),
          yaxis=dict(showgrid=True))
      ​
      fig.show()
      020k40k60k80k100k120k140k160k180k200k220k240k260k280k300k320k340k360k380k400k420k440k460k480k500k05001000150020002500300000.1M0.2M0.3M0.4M0.5M0.6M0.7M0.8M0.9M1M1.1M05001000150020002500
      BotHumanDistribution of values of training dataset column: tweet_counttweet_countfrequency
      plotly-logomark
      xxxxxxxxxx

      descr_len¶

      [56]:
       
      fig = go.Figure()
      fig = make_subplots(rows=1, cols=2, specs=[[{'type':'histogram'}, {'type':'histogram'}]])
      ​
      fig.add_trace(go.Histogram(
          x=train_users_data.loc[train_users_data['label']==1,'descr_len'],
          # histnorm='density',
          # nbinsx=30,
          name='Bot'),
      row=1, col=1
      )
      fig.add_trace(go.Histogram(
          x=train_users_data.loc[train_users_data['label']==0,'descr_len'],
          # histnorm='density',
          # nbinsx=30,
          name='Human'),
      row=1, col=2
      )
      ​
      fig.update_layout(
          title_text='Distribution of values of training dataset column: descr_len',
          xaxis_title_text='descr_len', #'feature',
          yaxis_title_text='frequency',
          bargap=0.2,
          bargroupgap=None, #0.8,
          width=1100,
          height=350,
          legend={"title":""},
          xaxis=dict(showgrid=True, dtick=10, range=[0, max(train_users_data['descr_len'])+5]),
          xaxis2=dict(showgrid=True, dtick=10, range=[0, max(train_users_data['descr_len'])+5]),
          yaxis=dict(showgrid=True))
      ​
      fig.show()
      01020304050607080901001101201301401501601701801902002102202302400500100001020304050607080901001101201301401501601701801902002102202302400100200300400
      BotHumanDistribution of values of training dataset column: descr_lendescr_lenfrequency
      plotly-logomark
      xxxxxxxxxx

      account_age¶

      [57]:
       
      fig = go.Figure()
      fig = make_subplots(rows=1, cols=2, specs=[[{'type':'histogram'}, {'type':'histogram'}]])
      ​
      fig.add_trace(go.Histogram(
          x=train_users_data.loc[train_users_data['label']==1,'account_age'],
          # histnorm='density',
          # nbinsx=30,
          name='Bot'),
      row=1, col=1
      )
      fig.add_trace(go.Histogram(
          x=train_users_data.loc[train_users_data['label']==0,'account_age'],
          # histnorm='density',
          # nbinsx=30,
          name='Human'),
      row=1, col=2
      )
      ​
      fig.update_layout(
          title_text='Distribution of values of training dataset column: account_age',
          xaxis_title_text='account_age', #'feature',
          yaxis_title_text='frequency',
          bargap=0.2,
          bargroupgap=None, #0.8,
          width=1100,
          height=350,
          legend={"title":""},
          xaxis=dict(showgrid=True, dtick=500, range=[0, max(train_users_data['account_age'])+250]),
          xaxis2=dict(showgrid=True, dtick=500, range=[0, max(train_users_data['account_age'])+250]),
          yaxis=dict(showgrid=True))
      ​
      fig.show()
      050010001500200025003000350040004500500055000100200300050010001500200025003000350040004500500055000100200300400
      BotHumanDistribution of values of training dataset column: account_ageaccount_agefrequency
      plotly-logomark
      [58]:
       
      len(train_users_data)
      [58]:
      7000
      xxxxxxxxxx

      Filter to have the same number of records for each class - part II¶

      [59]:
       
      train_users_data = filter_df_for_balanced_classes(train_users_data, bot_label_value=1, human_label_value=0)
      val_users_data = filter_df_for_balanced_classes(val_users_data, bot_label_value=1, human_label_value=0)
      test_users_data = filter_df_for_balanced_classes(test_users_data, bot_label_value=1, human_label_value=0)
      Number of bots:  3473
      Number of human users:  3473
      Number of bots:  743
      Number of human users:  743
      Number of bots:  730
      Number of human users:  730
      
      xxxxxxxxxx

      First drop columns in dataframes where there are same value in whole columns in training dataset¶

      [60]:
       
      same_data_columns = list(train_users_data.columns[train_users_data.apply(lambda x: x.nunique()) == 1])
      same_data_columns
      [60]:
      []
      [61]:
       
      train_users_data = train_users_data.drop(same_data_columns, axis=1)
      ​
      val_users_data = val_users_data.drop(same_data_columns, axis=1)
      test_users_data = test_users_data.drop(same_data_columns, axis=1)
      xxxxxxxxxx

      Standardize data by column range of training set¶

      [62]:
       
      def standardize_column(df, col_name, mean_training, std_training):
          df_cp = df.copy()
          df_cp[col_name] = (df[col_name] - mean_training) / std_training
      ​
          return df_cp
      [63]:
       
      # columns_to_standardize = ['followers_count', 'following_count', 'tweet_count', 'descr_len', 'account_age']
      columns_to_standardize = list(train_users_data.columns)
      columns_to_standardize.remove('label')
      columns_to_standardize.remove('id')
      [64]:
       
      for column_name in columns_to_standardize:
          mean_training = train_users_data[column_name].mean()
          std_training = train_users_data[column_name].std()
          print(column_name)
          print("mean_training = ", mean_training)
          print("std_training = ", std_training)
          print()
      ​
          train_users_data = standardize_column(train_users_data, column_name, mean_training, std_training)
          val_users_data = standardize_column(val_users_data, column_name, mean_training, std_training)
          test_users_data = standardize_column(test_users_data, column_name, mean_training, std_training)
      followers_count
      mean_training =  6208.796717535272
      std_training =  44145.842669630874
      
      tweet_count
      mean_training =  6595.194212496401
      std_training =  33286.69197462025
      
      following_count
      mean_training =  1259.0381514540743
      std_training =  6144.786564521689
      
      account_age
      mean_training =  2443.6521739130435
      std_training =  1640.571178505393
      
      descr_len
      mean_training =  84.77094730780306
      std_training =  59.6092442340293
      
      
      xxxxxxxxxx

      Correlation¶

      [65]:
       
      sns.set(font_scale=2)
      [66]:
       
      corr_threshold = 0.52
      corr = train_users_data.drop(['id'], axis=1).corr()
      lower_tri = corr.where(np.tril(np.ones(corr.shape),k=-1).astype(bool)) #creating lower triangular correlation matrix
      f = plt.figure(figsize=(20, 15))
      sns.heatmap(lower_tri, cmap="PiYG", annot=True, vmin=-1, vmax=1, ax=plt.gca()) #, annot_kws={"fontsize": 16})
      high_corr = []
      for column in train_users_data:
          if (column != 'id'):
              for col in train_users_data:
                  if (col != 'id'):
                      if abs(lower_tri[column][col]) > corr_threshold:
                          high_corr.append((column, col, lower_tri[column][col]))
      high_corr = sorted(high_corr, key=lambda x: x[2], reverse=True)
      [67]:
       
      sns.set(font_scale=1)
      [68]:
       
      print("Number of columns containing high correlation:", len(set([x[0] for x in high_corr])))
      high_corr
      Number of columns containing high correlation: 0
      
      [68]:
      []
      [69]:
       
      # train_users_data = train_users_data.drop(['listed_count'], axis=1)
      # val_users_data = val_users_data.drop(['listed_count'], axis=1)
      # test_users_data = test_users_data.drop(['listed_count'], axis=1)
      ​
      # train_users_data = train_users_data.drop(['has_description'], axis=1)
      # val_users_data = val_users_data.drop(['has_description'], axis=1)
      # test_users_data = test_users_data.drop(['has_description'], axis=1)
      [70]:
       
      train_users_data
      [70]:
      label id followers_count tweet_count following_count account_age descr_len
      6625 0.0 1214018601683836928 -0.128094 -0.192545 -0.113436 -1.001878 1.043950
      2489 0.0 109927809 -0.078032 -0.044047 0.038726 1.209547 1.111053
      9919 0.0 2325624539 -0.092643 -0.120715 0.608965 0.315346 1.262037
      6964 1.0 1362188147250061315 -0.140643 -0.198103 -0.196107 -1.250572 -1.422111
      3467 0.0 1105810614935531521 -0.133711 -0.059549 -0.178369 -0.819624 0.993622
      ... ... ... ... ... ... ... ...
      3325 0.0 1059189764 -0.128705 -0.129968 0.053210 0.557335 1.262037
      1881 0.0 235022253 -0.116065 -0.046150 0.438414 1.001083 1.060726
      4861 0.0 3053537383 -0.140371 -0.198013 -0.189598 0.078234 -0.616867
      1175 0.0 4348813577 -0.139805 -0.197232 -0.155423 -0.090000 -1.422111
      8447 0.0 912359571947085824 3.903521 -0.173889 -0.135731 -0.494128 0.909742

      6946 rows × 7 columns

      xxxxxxxxxx

      Split users data for input and output¶

      [71]:
       
      train_users_data_X = train_users_data.drop(['label'], axis=1)
      train_users_data_Y = pd.concat([train_users_data['label']], axis=1)
      val_users_data_X = val_users_data.drop(['label'], axis=1)
      val_users_data_Y = pd.concat([val_users_data['label']], axis=1)
      test_users_data_X = test_users_data.drop(['label'], axis=1)
      test_users_data_Y = pd.concat([test_users_data['label']], axis=1)
      xxxxxxxxxx

      Tweets data¶

      xxxxxxxxxx

      xxxxxxxxxx

      Loading data¶

      xxxxxxxxxx
      Loading users data to retriew label to add for each tweet¶
      [72]:
       
      dataset_name = "twitbot_22_preprocessed_common_users_ids"
      users_table_name = "users"
      BQ_TABLE_USERS = dataset_name + "." + users_table_name
      users_table_id = project_id + "." + BQ_TABLE_USERS
      [73]:
       
      SQL_QUERY = f"""WITH 
        human_records AS (
          SELECT *, ROW_NUMBER() OVER () row_num 
          FROM {BQ_TABLE_USERS}
          WHERE label = 'human' 
          LIMIT 5000),
        bot_records AS (
        SELECT *, ROW_NUMBER() OVER () row_num 
          FROM {BQ_TABLE_USERS}
          WHERE label = 'bot' 
          LIMIT 5000)
        SELECT * FROM human_records 
          UNION ALL SELECT * 
          FROM bot_records 
          ORDER BY row_num;"""
      ​
      users_df1 = bqclient.query(SQL_QUERY).to_dataframe()
      users_df1 = users_df1.drop(['row_num'], axis=1)
      xxxxxxxxxx
      Load tweets data¶
      [74]:
       
      dataset_name = "twitbot_22_preprocessed_common_users_ids"
      tweets_table_name = "tweets"
      BQ_TABLE_TWEETS = dataset_name + "." + tweets_table_name
      tweets_table_id = project_id + "." + BQ_TABLE_TWEETS
      [75]:
       
      # comma-separated string of user IDs from users dataframe
      users_df0 = pd.DataFrame(users_df1).copy()
      users_df0['id'] = users_df0['id'].astype(str)
      user_ids = users_df0['id'].to_list()
      ​
      # # SQL query to select records from the 'tweets' table
      SQL_QUERY = f"""SELECT * FROM {BQ_TABLE_TWEETS} WHERE CAST(author_id AS STRING) IN ({str(user_ids)[1:-1]})"""
      ​
      tweets_df1 = bqclient.query(SQL_QUERY).to_dataframe()
      [76]:
       
      # LIMIT RESULTS OPTIONS
      pd.set_option('display.max_rows', 100)
      # pd.set_option('display.max_rows', None)
      pd.set_option('display.max_column', None)
      pd.set_option('display.max_colwidth', None)
      [77]:
       
      len(tweets_df1)
      [77]:
      426163
      xxxxxxxxxx

      Append to tweets dataset author label (1/0, bot/human)¶

      [78]:
       
      user_id_label_dict = users_df1.set_index('id')['label'].to_dict()
      tweets_df1['author_label'] = tweets_df1['author_id'].map(user_id_label_dict)
      [79]:
       
      # tweets_df1
      [80]:
       
      org_tweet_df = pd.DataFrame(tweets_df1).copy()
      tweets_df = pd.DataFrame(org_tweet_df).copy()
      [81]:
       
      tweets_df.columns
      [81]:
      Index(['id', 'author_id', 'created_at', 'org_text', 'text', 'source',
             'withheld', 'copyright_infringement', 'is_reply', 'geo_tagged',
             'latitude', 'longitude', 'conversation_id', 'reply_settings',
             'retweet_count', 'reply_count', 'like_count', 'quote_count',
             'any_polls_attached', 'any_media_attached', 'possibly_sensitive',
             'has_referenced_tweets', 'media_attached', 'no_cashtags', 'no_mentions',
             'no_user_mentions', 'user_mentions', 'no_urls', 'contains_images',
             'contains_annotations', 'no_hashtags', 'hashtags',
             'context_annotations_domain_id', 'context_annotations_domain_name',
             'context_annotations_entity_id', 'context_annotations_entity_name',
             'author_label'],
            dtype='object')
      [82]:
       
      len(tweets_df)
      [82]:
      426163
      [83]:
       
      tweets_df
      [83]:
      id author_id created_at org_text text source withheld copyright_infringement is_reply geo_tagged latitude longitude conversation_id reply_settings retweet_count reply_count like_count quote_count any_polls_attached any_media_attached possibly_sensitive has_referenced_tweets media_attached no_cashtags no_mentions no_user_mentions user_mentions no_urls contains_images contains_annotations no_hashtags hashtags context_annotations_domain_id context_annotations_domain_name context_annotations_entity_id context_annotations_entity_name author_label
      0 t1485719054551855120 50338306 1643058000 North Carolina Governor Roy Cooper on Friday sought federal help in the Charlotte area as hospitals across the state face record numbers of patients hospitalized with COVID-19. https://t.co/1CZNI3pPza north carolina governor roy cooper on friday sought federal help in the charlotte area as hospitals across the state face record numbers of patients hospitalized with covid Sked Social False False False False NaN NaN 1485719054551855120 None 0 0 0 0 False False False False False 0 0 0 [] 1 False False 0 [] <NA> None <NA> None human
      1 t1466853630691377152 50338306 1638560133 On this visit to St. Louis, I wanted to focus on Union Station, a magnificent property located in the heart of the city which features an aquarium, numerous dining options, a Ferris wheel and, of course, the Union Station Hotel.  https://t.co/HlDeMqA2n8 on this visit to st louis i wanted to focus on union station a magnificent property located in the heart of the city which features an aquarium numerous dining options a ferris wheel and of course the union station hotel Sked Social False False False False NaN NaN 1466853630691377152 None 0 0 0 0 False False False False False 0 0 0 [] 1 False False 0 [] <NA> None <NA> None human
      2 t1446494364792987682 50338306 1633706105 Alice Dunbar-Nelson was a racially-mixed bisexual poet and author whose career spanned multiple literary genres and culminated during the Harlem Renaissance. She was also a lifelong educator and activist who fought for women’s suffrage and equality  https://t.co/WqsxXTZKqM https://t.co/XQevKfSRyU alice dunbarnelson was a raciallymixed bisexual poet and author whose career spanned multiple literary genres and culminated during the harlem renaissance she was also a lifelong educator and activist who fought for womens suffrage and equality Sked Social False False False False NaN NaN 1446494364792987682 None 0 0 0 0 False False False False False 0 0 0 [] 1 False False 0 [] <NA> None <NA> None human
      3 t1486413634813276165 50338306 1643223601 The new year is finally here. We are headed into the third year of a devastating pandemic in a nation that seems more divided than ever. https://t.co/EiWrMvSCPA the new year is finally here we are headed into the third year of a devastating pandemic in a nation that seems more divided than ever Sked Social False False False False NaN NaN 1486413634813276165 None 0 0 0 0 False False False False False 0 0 0 [] 1 False False 0 [] <NA> None <NA> None human
      4 t1471189609078050821 106526969 1639593910 The warning letter highlights shortfalls in risk assessment, corrective and preventive action, complaint handling, device recalls and adverse event reporting at the Northridge, California, headquarters of Medtronic's diabetes segment. https://t.co/fMRpEWJPWC the warning letter highlights shortfalls in risk assessment corrective and preventive action complaint handling device recalls and adverse event reporting at the northridge california headquarters of medtronics diabetes segment Twitter Web App False False False False NaN NaN 1471189609078050821 None 1 0 0 0 False False False False False 0 0 0 [] 1 False False 0 [] <NA> None <NA> None human
      ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
      426158 t1486308593049780232 1442657001440489480 1643198557 What if you got enough of the small points, that you didn't have to ace the midterm or the final. What's your secret studying hack?\n#beginagainandwin #thinkdifferent #motivation #help #passtheclass #learn #theblackmancan #B1 #youareenough #positivity #timemanagement #repetition�� https://t.co/EnTeJyo7jm what if you got enough of the small points that you didnt have to ace the midterm or the final whats your secret studying hack beginagainandwin thinkdifferent motivation help passtheclass learn theblackmancan b youareenough positivity timemanagement repetition PromoRepublic False False False False NaN NaN 1486308593049780232 None 1 0 1 0 False False False False False 0 0 0 [] 0 False False 12 [{'tagname': 'beginagainandwin'}, {'tagname': 'thinkdifferent'}, {'tagname': 'motivation'}, {'tagname': 'help'}, {'tagname': 'passtheclass'}, {'tagname': 'learn'}, {'tagname': 'theblackmancan'}, {'tagname': 'B1'}, {'tagname': 'youareenough'}, {'tagname': 'positivity'}, {'tagname': 'timemanagement'}, {'tagname': 'repetition'}] <NA> None <NA> None bot
      426159 t1495030852715118594 1447249447424036866 1645278106 SCOPE OF CLOUD STORAGE \n\n#Mcoin aims to disrupt the cloud storage industry by eliminating the current industry barriers. M-coin network is building the most reliable #decentralized cloud data storage that ensures low costs and immutable security for the user’s data.\n\n#crypto https://t.co/XFshg0SVeY scope of cloud storage mcoin aims to disrupt the cloud storage industry by eliminating the current industry barriers mcoin network is building the most reliable decentralized cloud data storage that ensures low costs and immutable security for the users data crypto Twitter Web App False False False False NaN NaN 1495030852715118594 None 1 0 1 0 False False False False False 0 0 0 [] 0 False False 3 [{'tagname': 'Mcoin'}, {'tagname': 'decentralized'}, {'tagname': 'crypto'}] <NA> None <NA> None bot
      426160 t1495671246339850240 1468895392699846656 1645430788 Everything in life is somewhere else, and you get there in a car.\n“The road goes on forever and the party never ends.”\nLive your life by a Compass not a clock\n#Rentalcars #Cab #self #driving #longdrive #safe #mydriver #\n��https://t.co/qYwHNQ8eQI\n��8886399949 https://t.co/dryhxD1XJI everything in life is somewhere else and you get there in a car the road goes on forever and the party never ends live your life by a compass not a clock rentalcars cab self driving longdrive safe mydriver Twitter Web App False False False False NaN NaN 1495671246339850240 None 0 0 0 0 False False False False False 0 0 0 [] 1 False False 7 [{'tagname': 'Rentalcars'}, {'tagname': 'Cab'}, {'tagname': 'self'}, {'tagname': 'driving'}, {'tagname': 'longdrive'}, {'tagname': 'safe'}, {'tagname': 'mydriver'}] <NA> None <NA> None bot
      426161 t1488487987289731074 1469339407345934338 1643718165 Cybersecurity and privacy must be priorities for nonprofits—the risk of ignoring increased cyberthreats is too great. Access this e-book to find out how you can use #Microsoft 365 to build agile security frameworks in your organization. https://t.co/jhMUXYhcNU cybersecurity and privacy must be priorities for nonprofitsthe risk of ignoring increased cyberthreats is too great access this ebook to find out how you can use microsoft to build agile security frameworks in your organization ContentMX False False False False NaN NaN 1488487987289731074 None 0 0 0 0 False False False False False 0 0 0 [] 1 False False 1 [{'tagname': 'Microsoft'}] <NA> None <NA> None human
      426162 t1491185930106978304 1477407428950036485 1644361405 OK #100Devs I need webcam recommendations. Doesn't have to be cheap, but I don't want to pay $500 either. \n\nRight now I am looking at the Logitech C920x HD Pro Webcam. Open to all suggestions. ok devs i need webcam recommendations doesnt have to be cheap but i dont want to pay either right now i am looking at the logitech cx hd pro webcam open to all suggestions TweetDeck False False False False NaN NaN 1491185930106978304 None 0 0 2 0 False False False False False 0 0 0 [] 0 False False 1 [{'tagname': '100Devs'}] <NA> None <NA> None bot

      426163 rows × 37 columns

      xxxxxxxxxx

      Data preparation¶

      [84]:
       
      def drop_columns(df, columns):
          for column_name in columns:
              curr_df_all_cols = df.columns
              if column_name in curr_df_all_cols:
                  df = df.drop([column_name], axis=1)
          return df
      [85]:
       
      def encode_not_numeric_columns(df):
        for column_name in df:
          if not is_numeric_dtype(df[column_name]):
            unique_values_dict = dict(enumerate(df[column_name].unique()))
            unique_values_dict = dict((v, k) for k, v in unique_values_dict.items())
            df[column_name] = df[column_name].map(unique_values_dict)
        return df
      xxxxxxxxxx

      Null and NaN statistics¶

      [86]:
       
      for col_name in tweets_df:
          count1 = pd.isnull(tweets_df[col_name]).sum()
          print(col_name + ": " + str(count1))
      id: 0
      author_id: 0
      created_at: 0
      org_text: 0
      text: 0
      source: 0
      withheld: 0
      copyright_infringement: 0
      is_reply: 0
      geo_tagged: 0
      latitude: 422884
      longitude: 422884
      conversation_id: 0
      reply_settings: 420428
      retweet_count: 0
      reply_count: 0
      like_count: 0
      quote_count: 0
      any_polls_attached: 0
      any_media_attached: 0
      possibly_sensitive: 0
      has_referenced_tweets: 0
      media_attached: 0
      no_cashtags: 0
      no_mentions: 0
      no_user_mentions: 0
      user_mentions: 0
      no_urls: 0
      contains_images: 0
      contains_annotations: 0
      no_hashtags: 0
      hashtags: 0
      context_annotations_domain_id: 426163
      context_annotations_domain_name: 426163
      context_annotations_entity_id: 426163
      context_annotations_entity_name: 426163
      author_label: 0
      
      [87]:
       
      for col_name in tweets_df:
          count1 = pd.isnull(tweets_df[col_name]).sum()
          print(col_name + ": " + str(count1))
      id: 0
      author_id: 0
      created_at: 0
      org_text: 0
      text: 0
      source: 0
      withheld: 0
      copyright_infringement: 0
      is_reply: 0
      geo_tagged: 0
      latitude: 422884
      longitude: 422884
      conversation_id: 0
      reply_settings: 420428
      retweet_count: 0
      reply_count: 0
      like_count: 0
      quote_count: 0
      any_polls_attached: 0
      any_media_attached: 0
      possibly_sensitive: 0
      has_referenced_tweets: 0
      media_attached: 0
      no_cashtags: 0
      no_mentions: 0
      no_user_mentions: 0
      user_mentions: 0
      no_urls: 0
      contains_images: 0
      contains_annotations: 0
      no_hashtags: 0
      hashtags: 0
      context_annotations_domain_id: 426163
      context_annotations_domain_name: 426163
      context_annotations_entity_id: 426163
      context_annotations_entity_name: 426163
      author_label: 0
      
      xxxxxxxxxx

      reply_settings¶

      xxxxxxxxxx

      Twitter documentation for that field mantions that: If the field isn’t specified, it will default to everyone.

      [88]:
       
      set(tweets_df.loc[tweets_df['reply_settings'].notna()]['reply_settings'])
      [88]:
      {'everyone', 'following', 'mentionedUsers'}
      [89]:
       
      set(tweets_df['reply_settings'])
      [89]:
      {None, 'everyone', 'following', 'mentionedUsers'}
      xxxxxxxxxx

      Replace not specified value / None with 'everyone'¶

      [90]:
       
      tweets_df['reply_settings'].fillna('everyone', inplace=True)
      xxxxxxxxxx

      Remove columns with most lacking value¶

      [91]:
       
      most_nan_columns = ['context_annotations_domain_id',
                          'context_annotations_domain_name',
                          'context_annotations_entity_id',
                          'context_annotations_entity_name',
                          'latitude',
                          'longitude']
      tweets_df = drop_columns(tweets_df, most_nan_columns)
      xxxxxxxxxx

      Encoding of non-numeric information which will be used by model¶

      xxxxxxxxxx

      Encode boolean columns¶

      [92]:
       
      boolean_columns  = ['withheld',
                          'copyright_infringement',
                          'is_reply',
                          'geo_tagged',
                          'any_polls_attached',
                          'any_media_attached',
                          'possibly_sensitive',
                          'has_referenced_tweets',
                          'media_attached',
                          'contains_images',
                          'contains_annotations']
      [93]:
       
      # Remap the values of the dataframe
      for col_name in boolean_columns:
          tweets_df[col_name] = tweets_df[col_name].map({True:1,False:0})
          
      # Remap label values human/bot for 0/1
      label_col = "author_label"
      tweets_df[label_col] = tweets_df[label_col].map({"human":0,"bot":1})
      xxxxxxxxxx

      Encode reply_settings categorical column¶

      [94]:
       
      reply_settings_dict = {'everyone' : 0, 
                             'following' : 1, 
                             'mentionedUsers' : 2}
      [95]:
       
      tweets_df['reply_settings'] = tweets_df['reply_settings'].map(reply_settings_dict)
      [96]:
       
      set(tweets_df['reply_settings'])
      [96]:
      {0, 1, 2}
      [97]:
       
      tweets_df
      [97]:
      id author_id created_at org_text text source withheld copyright_infringement is_reply geo_tagged conversation_id reply_settings retweet_count reply_count like_count quote_count any_polls_attached any_media_attached possibly_sensitive has_referenced_tweets media_attached no_cashtags no_mentions no_user_mentions user_mentions no_urls contains_images contains_annotations no_hashtags hashtags author_label
      0 t1485719054551855120 50338306 1643058000 North Carolina Governor Roy Cooper on Friday sought federal help in the Charlotte area as hospitals across the state face record numbers of patients hospitalized with COVID-19. https://t.co/1CZNI3pPza north carolina governor roy cooper on friday sought federal help in the charlotte area as hospitals across the state face record numbers of patients hospitalized with covid Sked Social 0 0 0 0 1485719054551855120 0 0 0 0 0 0 0 0 0 0 0 0 0 [] 1 0 0 0 [] 0
      1 t1466853630691377152 50338306 1638560133 On this visit to St. Louis, I wanted to focus on Union Station, a magnificent property located in the heart of the city which features an aquarium, numerous dining options, a Ferris wheel and, of course, the Union Station Hotel.  https://t.co/HlDeMqA2n8 on this visit to st louis i wanted to focus on union station a magnificent property located in the heart of the city which features an aquarium numerous dining options a ferris wheel and of course the union station hotel Sked Social 0 0 0 0 1466853630691377152 0 0 0 0 0 0 0 0 0 0 0 0 0 [] 1 0 0 0 [] 0
      2 t1446494364792987682 50338306 1633706105 Alice Dunbar-Nelson was a racially-mixed bisexual poet and author whose career spanned multiple literary genres and culminated during the Harlem Renaissance. She was also a lifelong educator and activist who fought for women’s suffrage and equality  https://t.co/WqsxXTZKqM https://t.co/XQevKfSRyU alice dunbarnelson was a raciallymixed bisexual poet and author whose career spanned multiple literary genres and culminated during the harlem renaissance she was also a lifelong educator and activist who fought for womens suffrage and equality Sked Social 0 0 0 0 1446494364792987682 0 0 0 0 0 0 0 0 0 0 0 0 0 [] 1 0 0 0 [] 0
      3 t1486413634813276165 50338306 1643223601 The new year is finally here. We are headed into the third year of a devastating pandemic in a nation that seems more divided than ever. https://t.co/EiWrMvSCPA the new year is finally here we are headed into the third year of a devastating pandemic in a nation that seems more divided than ever Sked Social 0 0 0 0 1486413634813276165 0 0 0 0 0 0 0 0 0 0 0 0 0 [] 1 0 0 0 [] 0
      4 t1471189609078050821 106526969 1639593910 The warning letter highlights shortfalls in risk assessment, corrective and preventive action, complaint handling, device recalls and adverse event reporting at the Northridge, California, headquarters of Medtronic's diabetes segment. https://t.co/fMRpEWJPWC the warning letter highlights shortfalls in risk assessment corrective and preventive action complaint handling device recalls and adverse event reporting at the northridge california headquarters of medtronics diabetes segment Twitter Web App 0 0 0 0 1471189609078050821 0 1 0 0 0 0 0 0 0 0 0 0 0 [] 1 0 0 0 [] 0
      ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
      426158 t1486308593049780232 1442657001440489480 1643198557 What if you got enough of the small points, that you didn't have to ace the midterm or the final. What's your secret studying hack?\n#beginagainandwin #thinkdifferent #motivation #help #passtheclass #learn #theblackmancan #B1 #youareenough #positivity #timemanagement #repetition�� https://t.co/EnTeJyo7jm what if you got enough of the small points that you didnt have to ace the midterm or the final whats your secret studying hack beginagainandwin thinkdifferent motivation help passtheclass learn theblackmancan b youareenough positivity timemanagement repetition PromoRepublic 0 0 0 0 1486308593049780232 0 1 0 1 0 0 0 0 0 0 0 0 0 [] 0 0 0 12 [{'tagname': 'beginagainandwin'}, {'tagname': 'thinkdifferent'}, {'tagname': 'motivation'}, {'tagname': 'help'}, {'tagname': 'passtheclass'}, {'tagname': 'learn'}, {'tagname': 'theblackmancan'}, {'tagname': 'B1'}, {'tagname': 'youareenough'}, {'tagname': 'positivity'}, {'tagname': 'timemanagement'}, {'tagname': 'repetition'}] 1
      426159 t1495030852715118594 1447249447424036866 1645278106 SCOPE OF CLOUD STORAGE \n\n#Mcoin aims to disrupt the cloud storage industry by eliminating the current industry barriers. M-coin network is building the most reliable #decentralized cloud data storage that ensures low costs and immutable security for the user’s data.\n\n#crypto https://t.co/XFshg0SVeY scope of cloud storage mcoin aims to disrupt the cloud storage industry by eliminating the current industry barriers mcoin network is building the most reliable decentralized cloud data storage that ensures low costs and immutable security for the users data crypto Twitter Web App 0 0 0 0 1495030852715118594 0 1 0 1 0 0 0 0 0 0 0 0 0 [] 0 0 0 3 [{'tagname': 'Mcoin'}, {'tagname': 'decentralized'}, {'tagname': 'crypto'}] 1
      426160 t1495671246339850240 1468895392699846656 1645430788 Everything in life is somewhere else, and you get there in a car.\n“The road goes on forever and the party never ends.”\nLive your life by a Compass not a clock\n#Rentalcars #Cab #self #driving #longdrive #safe #mydriver #\n��https://t.co/qYwHNQ8eQI\n��8886399949 https://t.co/dryhxD1XJI everything in life is somewhere else and you get there in a car the road goes on forever and the party never ends live your life by a compass not a clock rentalcars cab self driving longdrive safe mydriver Twitter Web App 0 0 0 0 1495671246339850240 0 0 0 0 0 0 0 0 0 0 0 0 0 [] 1 0 0 7 [{'tagname': 'Rentalcars'}, {'tagname': 'Cab'}, {'tagname': 'self'}, {'tagname': 'driving'}, {'tagname': 'longdrive'}, {'tagname': 'safe'}, {'tagname': 'mydriver'}] 1
      426161 t1488487987289731074 1469339407345934338 1643718165 Cybersecurity and privacy must be priorities for nonprofits—the risk of ignoring increased cyberthreats is too great. Access this e-book to find out how you can use #Microsoft 365 to build agile security frameworks in your organization. https://t.co/jhMUXYhcNU cybersecurity and privacy must be priorities for nonprofitsthe risk of ignoring increased cyberthreats is too great access this ebook to find out how you can use microsoft to build agile security frameworks in your organization ContentMX 0 0 0 0 1488487987289731074 0 0 0 0 0 0 0 0 0 0 0 0 0 [] 1 0 0 1 [{'tagname': 'Microsoft'}] 0
      426162 t1491185930106978304 1477407428950036485 1644361405 OK #100Devs I need webcam recommendations. Doesn't have to be cheap, but I don't want to pay $500 either. \n\nRight now I am looking at the Logitech C920x HD Pro Webcam. Open to all suggestions. ok devs i need webcam recommendations doesnt have to be cheap but i dont want to pay either right now i am looking at the logitech cx hd pro webcam open to all suggestions TweetDeck 0 0 0 0 1491185930106978304 0 0 0 2 0 0 0 0 0 0 0 0 0 [] 0 0 0 1 [{'tagname': '100Devs'}] 1

      426163 rows × 31 columns

      xxxxxxxxxx

      Extract some information from dataframe to new columns¶

      xxxxxxxxxx

      Tweet length¶

      [98]:
       
      tweets_df['cleaned_tweet_len'] = tweets_df['text'].apply(len).astype(float)
      tweets_df['org_tweet_len'] = tweets_df['org_text'].apply(len).astype(float)
      [99]:
       
      tweets_df
      [99]:
      id author_id created_at org_text text source withheld copyright_infringement is_reply geo_tagged conversation_id reply_settings retweet_count reply_count like_count quote_count any_polls_attached any_media_attached possibly_sensitive has_referenced_tweets media_attached no_cashtags no_mentions no_user_mentions user_mentions no_urls contains_images contains_annotations no_hashtags hashtags author_label cleaned_tweet_len org_tweet_len
      0 t1485719054551855120 50338306 1643058000 North Carolina Governor Roy Cooper on Friday sought federal help in the Charlotte area as hospitals across the state face record numbers of patients hospitalized with COVID-19. https://t.co/1CZNI3pPza north carolina governor roy cooper on friday sought federal help in the charlotte area as hospitals across the state face record numbers of patients hospitalized with covid Sked Social 0 0 0 0 1485719054551855120 0 0 0 0 0 0 0 0 0 0 0 0 0 [] 1 0 0 0 [] 0 172.0 200.0
      1 t1466853630691377152 50338306 1638560133 On this visit to St. Louis, I wanted to focus on Union Station, a magnificent property located in the heart of the city which features an aquarium, numerous dining options, a Ferris wheel and, of course, the Union Station Hotel.  https://t.co/HlDeMqA2n8 on this visit to st louis i wanted to focus on union station a magnificent property located in the heart of the city which features an aquarium numerous dining options a ferris wheel and of course the union station hotel Sked Social 0 0 0 0 1466853630691377152 0 0 0 0 0 0 0 0 0 0 0 0 0 [] 1 0 0 0 [] 0 220.0 253.0
      2 t1446494364792987682 50338306 1633706105 Alice Dunbar-Nelson was a racially-mixed bisexual poet and author whose career spanned multiple literary genres and culminated during the Harlem Renaissance. She was also a lifelong educator and activist who fought for women’s suffrage and equality  https://t.co/WqsxXTZKqM https://t.co/XQevKfSRyU alice dunbarnelson was a raciallymixed bisexual poet and author whose career spanned multiple literary genres and culminated during the harlem renaissance she was also a lifelong educator and activist who fought for womens suffrage and equality Sked Social 0 0 0 0 1446494364792987682 0 0 0 0 0 0 0 0 0 0 0 0 0 [] 1 0 0 0 [] 0 244.0 297.0
      3 t1486413634813276165 50338306 1643223601 The new year is finally here. We are headed into the third year of a devastating pandemic in a nation that seems more divided than ever. https://t.co/EiWrMvSCPA the new year is finally here we are headed into the third year of a devastating pandemic in a nation that seems more divided than ever Sked Social 0 0 0 0 1486413634813276165 0 0 0 0 0 0 0 0 0 0 0 0 0 [] 1 0 0 0 [] 0 134.0 160.0
      4 t1471189609078050821 106526969 1639593910 The warning letter highlights shortfalls in risk assessment, corrective and preventive action, complaint handling, device recalls and adverse event reporting at the Northridge, California, headquarters of Medtronic's diabetes segment. https://t.co/fMRpEWJPWC the warning letter highlights shortfalls in risk assessment corrective and preventive action complaint handling device recalls and adverse event reporting at the northridge california headquarters of medtronics diabetes segment Twitter Web App 0 0 0 0 1471189609078050821 0 1 0 0 0 0 0 0 0 0 0 0 0 [] 1 0 0 0 [] 0 227.0 258.0
      ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
      426158 t1486308593049780232 1442657001440489480 1643198557 What if you got enough of the small points, that you didn't have to ace the midterm or the final. What's your secret studying hack?\n#beginagainandwin #thinkdifferent #motivation #help #passtheclass #learn #theblackmancan #B1 #youareenough #positivity #timemanagement #repetition�� https://t.co/EnTeJyo7jm what if you got enough of the small points that you didnt have to ace the midterm or the final whats your secret studying hack beginagainandwin thinkdifferent motivation help passtheclass learn theblackmancan b youareenough positivity timemanagement repetition PromoRepublic 0 0 0 0 1486308593049780232 0 1 0 1 0 0 0 0 0 0 0 0 0 [] 0 0 0 12 [{'tagname': 'beginagainandwin'}, {'tagname': 'thinkdifferent'}, {'tagname': 'motivation'}, {'tagname': 'help'}, {'tagname': 'passtheclass'}, {'tagname': 'learn'}, {'tagname': 'theblackmancan'}, {'tagname': 'B1'}, {'tagname': 'youareenough'}, {'tagname': 'positivity'}, {'tagname': 'timemanagement'}, {'tagname': 'repetition'}] 1 260.0 304.0
      426159 t1495030852715118594 1447249447424036866 1645278106 SCOPE OF CLOUD STORAGE \n\n#Mcoin aims to disrupt the cloud storage industry by eliminating the current industry barriers. M-coin network is building the most reliable #decentralized cloud data storage that ensures low costs and immutable security for the user’s data.\n\n#crypto https://t.co/XFshg0SVeY scope of cloud storage mcoin aims to disrupt the cloud storage industry by eliminating the current industry barriers mcoin network is building the most reliable decentralized cloud data storage that ensures low costs and immutable security for the users data crypto Twitter Web App 0 0 0 0 1495030852715118594 0 1 0 1 0 0 0 0 0 0 0 0 0 [] 0 0 0 3 [{'tagname': 'Mcoin'}, {'tagname': 'decentralized'}, {'tagname': 'crypto'}] 1 265.0 299.0
      426160 t1495671246339850240 1468895392699846656 1645430788 Everything in life is somewhere else, and you get there in a car.\n“The road goes on forever and the party never ends.”\nLive your life by a Compass not a clock\n#Rentalcars #Cab #self #driving #longdrive #safe #mydriver #\n��https://t.co/qYwHNQ8eQI\n��8886399949 https://t.co/dryhxD1XJI everything in life is somewhere else and you get there in a car the road goes on forever and the party never ends live your life by a compass not a clock rentalcars cab self driving longdrive safe mydriver Twitter Web App 0 0 0 0 1495671246339850240 0 0 0 0 0 0 0 0 0 0 0 0 0 [] 1 0 0 7 [{'tagname': 'Rentalcars'}, {'tagname': 'Cab'}, {'tagname': 'self'}, {'tagname': 'driving'}, {'tagname': 'longdrive'}, {'tagname': 'safe'}, {'tagname': 'mydriver'}] 1 205.0 282.0
      426161 t1488487987289731074 1469339407345934338 1643718165 Cybersecurity and privacy must be priorities for nonprofits—the risk of ignoring increased cyberthreats is too great. Access this e-book to find out how you can use #Microsoft 365 to build agile security frameworks in your organization. https://t.co/jhMUXYhcNU cybersecurity and privacy must be priorities for nonprofitsthe risk of ignoring increased cyberthreats is too great access this ebook to find out how you can use microsoft to build agile security frameworks in your organization ContentMX 0 0 0 0 1488487987289731074 0 0 0 0 0 0 0 0 0 0 0 0 0 [] 1 0 0 1 [{'tagname': 'Microsoft'}] 0 227.0 260.0
      426162 t1491185930106978304 1477407428950036485 1644361405 OK #100Devs I need webcam recommendations. Doesn't have to be cheap, but I don't want to pay $500 either. \n\nRight now I am looking at the Logitech C920x HD Pro Webcam. Open to all suggestions. ok devs i need webcam recommendations doesnt have to be cheap but i dont want to pay either right now i am looking at the logitech cx hd pro webcam open to all suggestions TweetDeck 0 0 0 0 1491185930106978304 0 0 0 2 0 0 0 0 0 0 0 0 0 [] 0 0 0 1 [{'tagname': '100Devs'}] 1 171.0 195.0

      426163 rows × 33 columns

      xxxxxxxxxx

      Time of the date (in min, UTC)¶

      [100]:
       
      def convert_unixtime_to_datetime(a):
          return datetime.utcfromtimestamp(a)
      ​
      def get_time_in_minutes(unix_time):
          h = convert_unixtime_to_datetime(unix_time).hour
          m = convert_unixtime_to_datetime(unix_time).minute
          all_minutes = (h * 60) + m
          return all_minutes
      [101]:
       
      tweets_df['time_of_creation'] = tweets_df.apply(lambda x:  get_time_in_minutes(x.created_at), axis=1).astype(float)
      xxxxxxxxxx

      Add days different between tweets, days_since_prev_tweet¶

      [102]:
       
      tweets_df['creation_date'] = tweets_df.apply(lambda x: convert_unixtime_to_datetime(x.created_at), axis=1)
      tweets_df.sort_values(by=['author_id', 'creation_date'], inplace=True)
      [103]:
       
      grouped = tweets_df.groupby('author_id')
      tweets_df['days_since_prev_tweet'] = grouped['creation_date'].diff().dt.days
      tweets_df['days_since_prev_tweet'].fillna(0, inplace=True)
      tweets_df = tweets_df.drop(['creation_date'], axis=1)
      [104]:
       
      # Revert to original order 
      tweets_df.sort_index(inplace=True)
      xxxxxxxxxx

      Remove some leftover special characters¶

      [105]:
       
      tweets_df['text'] = tweets_df['text'].str.replace('|', '', regex=False)
      [106]:
       
      tweets_df.columns
      [106]:
      Index(['id', 'author_id', 'created_at', 'org_text', 'text', 'source',
             'withheld', 'copyright_infringement', 'is_reply', 'geo_tagged',
             'conversation_id', 'reply_settings', 'retweet_count', 'reply_count',
             'like_count', 'quote_count', 'any_polls_attached', 'any_media_attached',
             'possibly_sensitive', 'has_referenced_tweets', 'media_attached',
             'no_cashtags', 'no_mentions', 'no_user_mentions', 'user_mentions',
             'no_urls', 'contains_images', 'contains_annotations', 'no_hashtags',
             'hashtags', 'author_label', 'cleaned_tweet_len', 'org_tweet_len',
             'time_of_creation', 'days_since_prev_tweet'],
            dtype='object')
      [107]:
       
      add_tweets_feature_shp_values = ['is_reply', 'time_of_creation', 'no_urls',
                                       'no_hashtags',  'org_tweet_len', 'no_mentions', 
                                       'any_media_attached', 'contains_annotations', 
                                       'has_referenced_tweets', 'possibly_sensitive', 
                                       'no_user_mentions']
      [108]:
       
      col_to_leave =  ['id', 'author_id', 'text', 'days_since_prev_tweet', 'created_at'] + add_tweets_feature_shp_values
      tweets_df = tweets_df[col_to_leave]
      [109]:
       
      tweets_df.columns
      [109]:
      Index(['id', 'author_id', 'text', 'days_since_prev_tweet', 'created_at',
             'is_reply', 'time_of_creation', 'no_urls', 'no_hashtags',
             'org_tweet_len', 'no_mentions', 'any_media_attached',
             'contains_annotations', 'has_referenced_tweets', 'possibly_sensitive',
             'no_user_mentions'],
            dtype='object')
      xxxxxxxxxx

      Data split for training, validation and testing of tweets data based on users data split¶

      [110]:
       
      users_train_set_users_id = train_users_data['id']
      users_val_set_users_id = val_users_data['id']
      users_test_set_users_id = test_users_data['id']
      [111]:
       
      train_tweets_data = tweets_df[tweets_df['author_id'].isin(users_train_set_users_id)]
      val_tweets_data = tweets_df[tweets_df['author_id'].isin(users_val_set_users_id)]
      test_tweets_data = tweets_df[tweets_df['author_id'].isin(users_test_set_users_id)]
      [112]:
       
      train_tweets_data1 = pd.DataFrame(train_tweets_data).copy()
      val_tweets_data1 = pd.DataFrame(val_tweets_data).copy()
      test_tweets_data1 = pd.DataFrame(test_tweets_data).copy()
      xxxxxxxxxx

      Analysis¶

      [113]:
       
      train_tweets_data.columns
      [113]:
      Index(['id', 'author_id', 'text', 'days_since_prev_tweet', 'created_at',
             'is_reply', 'time_of_creation', 'no_urls', 'no_hashtags',
             'org_tweet_len', 'no_mentions', 'any_media_attached',
             'contains_annotations', 'has_referenced_tweets', 'possibly_sensitive',
             'no_user_mentions'],
            dtype='object')
      [114]:
       
      columns_to_standardize = [  #'id', 
                                  # 'author_id', 
                                  # 'created_at',
                                  # 'text', 
                                  'days_since_prev_tweet', 
                                  'is_reply',
                                  'time_of_creation',
                                  'no_urls',
                                  'no_hashtags',
                                  'org_tweet_len',
                                  'no_mentions', 
                                  'any_media_attached', 
                                  'contains_annotations',
                                  'has_referenced_tweets', 
                                  'possibly_sensitive', 
                                  'no_user_mentions']
      xxxxxxxxxx

      Remove unnecessery columns¶

      [115]:
       
      train_tweets_data.columns
      [115]:
      Index(['id', 'author_id', 'text', 'days_since_prev_tweet', 'created_at',
             'is_reply', 'time_of_creation', 'no_urls', 'no_hashtags',
             'org_tweet_len', 'no_mentions', 'any_media_attached',
             'contains_annotations', 'has_referenced_tweets', 'possibly_sensitive',
             'no_user_mentions'],
            dtype='object')
      [116]:
       
      unnecessery_col_to_remove = [# 'id',
                                   #'author_id',    # need later
                                   #'created_at',   # nead to sort tweets per user later
                                   'org_text',
                                   'source',
                                   'conversation_id',
                                   'user_mentions',
                                   'hashtags',
                                   'created_at_datetime']
      [117]:
       
      train_tweets_data = drop_columns(train_tweets_data, unnecessery_col_to_remove)
      val_tweets_data = drop_columns(val_tweets_data, unnecessery_col_to_remove)
      test_tweets_data = drop_columns(test_tweets_data, unnecessery_col_to_remove)
      [118]:
       
      columns_to_standardize = [ col for col in columns_to_standardize if col in train_tweets_data.columns]
      xxxxxxxxxx

      Data type conversion (to float)¶

      [119]:
       
      for column_name in columns_to_standardize:
          train_tweets_data[column_name] = train_tweets_data[column_name].astype(float)
          val_tweets_data[column_name] = val_tweets_data[column_name].astype(float)
          test_tweets_data[column_name] = test_tweets_data[column_name].astype(float)
      xxxxxxxxxx

      Standardize other tweets data by column range of training set¶

      [120]:
       
      def standardize_column(df, col_name, mean_training, std_training):
          df_cp = df.copy()
          df_cp[col_name] = (df[col_name] - mean_training) / std_training
      ​
          return df_cp
      xxxxxxxxxx

      Standardize¶

      [121]:
       
      for column_name in columns_to_standardize:
          mean_training = train_tweets_data[column_name].mean()
          std_training = train_tweets_data[column_name].std()
          print(column_name)
          print("mean_training = ", mean_training)
          print("std_training = ", std_training)
          print()
          
          train_tweets_data = standardize_column(train_tweets_data, column_name, mean_training, std_training)
          val_tweets_data = standardize_column(val_tweets_data, column_name, mean_training, std_training)
          test_tweets_data = standardize_column(test_tweets_data, column_name, mean_training, std_training)
      days_since_prev_tweet
      mean_training =  10.150619688444115
      std_training =  83.86048614893724
      
      is_reply
      mean_training =  0.15228149709017305
      std_training =  0.35929412990954285
      
      time_of_creation
      mean_training =  832.0637669213666
      std_training =  379.59396560713145
      
      no_urls
      mean_training =  0.5035845212495471
      std_training =  0.5677704754324685
      
      no_hashtags
      mean_training =  1.8141632627286233
      std_training =  3.3132555452757417
      
      org_tweet_len
      mean_training =  171.79513887734856
      std_training =  76.81292827031992
      
      no_mentions
      mean_training =  0.011390036460081694
      std_training =  0.19828870716436514
      
      any_media_attached
      mean_training =  0.005327758519262024
      std_training =  0.07279691697847118
      
      contains_annotations
      mean_training =  0.006201869867088545
      std_training =  0.07850749748981518
      
      has_referenced_tweets
      mean_training =  0.0032837338846106547
      std_training =  0.0572098055796236
      
      possibly_sensitive
      mean_training =  0.004453647171435504
      std_training =  0.06658698772772154
      
      no_user_mentions
      mean_training =  0.9017738145488023
      std_training =  1.3509284665873897
      
      
      xxxxxxxxxx

      Text preprocessing¶

      xxxxxxxxxx

      Create backup column for text before processing¶

      [122]:
       
      train_tweets_data.loc[:, 'text_np'] = train_tweets_data['text']
      [123]:
       
      val_tweets_data.loc[:, 'text_np'] = val_tweets_data['text']
      [124]:
       
      test_tweets_data.loc[:, 'text_np'] = test_tweets_data['text']
      xxxxxxxxxx

      Remove stop words and tokenize tweet text and use word embeddings¶

      [125]:
       
      from tensorflow.keras.preprocessing.text import Tokenizer
      from tensorflow.keras.preprocessing.sequence import pad_sequences
      [126]:
       
      !pip install nltk
      Collecting nltk
        Using cached nltk-3.8.1-py3-none-any.whl (1.5 MB)
      Requirement already satisfied: click in /opt/conda/lib/python3.7/site-packages (from nltk) (8.1.6)
      Requirement already satisfied: joblib in /opt/conda/lib/python3.7/site-packages (from nltk) (1.3.1)
      Collecting regex>=2021.8.3 (from nltk)
        Obtaining dependency information for regex>=2021.8.3 from https://files.pythonhosted.org/packages/63/78/ed291d95116695b8b5d7469a931d7c2e83d942df0853915ee504cee98bcf/regex-2023.8.8-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
        Using cached regex-2023.8.8-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB)
      Requirement already satisfied: tqdm in /opt/conda/lib/python3.7/site-packages (from nltk) (4.63.0)
      Requirement already satisfied: importlib-metadata in /opt/conda/lib/python3.7/site-packages (from click->nltk) (4.11.4)
      Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata->click->nltk) (3.15.0)
      Requirement already satisfied: typing-extensions>=3.6.4 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata->click->nltk) (4.7.1)
      Using cached regex-2023.8.8-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (758 kB)
      Installing collected packages: regex, nltk
      Successfully installed nltk-3.8.1 regex-2023.8.8
      
      [127]:
       
      import nltk
      from nltk.tokenize import word_tokenize
      from nltk.corpus import stopwords
      nltk.download('stopwords')
      [nltk_data] Downloading package stopwords to
      [nltk_data]     /home/jupyter/nltk_data...
      [nltk_data]   Package stopwords is already up-to-date!
      
      [127]:
      True
      [128]:
       
      nltk.download('punkt')
      [nltk_data] Downloading package punkt to /home/jupyter/nltk_data...
      [nltk_data]   Package punkt is already up-to-date!
      
      [128]:
      True
      [129]:
       
      train_tweets_data.loc[:, 'text_tk'] = train_tweets_data['text'].apply(lambda text : word_tokenize(text))
      val_tweets_data.loc[:, 'text_tk'] = val_tweets_data['text'].apply(lambda text : word_tokenize(text))
      test_tweets_data.loc[:, 'text_tk'] = test_tweets_data['text'].apply(lambda text : word_tokenize(text))
      xxxxxxxxxx

      Remove stopwords¶

      [130]:
       
      stop_words = stopwords.words('english')
      [131]:
       
      train_tweets_data.loc[:, 'text_tk'] = train_tweets_data['text_tk'].apply(lambda words: [word for word in words if word not in stop_words])
      val_tweets_data.loc[:, 'text_tk'] = val_tweets_data['text_tk'].apply(lambda words: [word for word in words if word not in stop_words])
      test_tweets_data.loc[:, 'text_tk'] = test_tweets_data['text_tk'].apply(lambda words: [word for word in words if word not in stop_words])
      xxxxxxxxxx

      To not have additional spaces¶

      [132]:
       
      train_tweets_data.loc[:, 'text'] = train_tweets_data['text_tk'].apply(lambda words : ' '.join(words))
      val_tweets_data.loc[:, 'text'] = val_tweets_data['text_tk'].apply(lambda words : ' '.join(words))
      test_tweets_data.loc[:, 'text'] = test_tweets_data['text_tk'].apply(lambda words : ' '.join(words))
      xxxxxxxxxx

      Word Embedding¶

      [133]:
       
      # !wget http://nlp.stanford.edu/data/glove.6B.zip
      # !unzip glove.6B.zip
      xxxxxxxxxx

      For training set¶

      [134]:
       
      tokenizer = Tokenizer()
      tokenizer.fit_on_texts(train_tweets_data['text'])
      word_index = tokenizer.word_index
      ​
      num_words = len(word_index) + 1  # adding 1 for padding token
      embedding_dim = 100              # using GloVe 100-dimensional vectors
      xxxxxxxxxx

      Integer encode text¶

      [135]:
       
      train_tweets_data.loc[:, 'text_seq'] = train_tweets_data['text'].apply(lambda text:  tokenizer.texts_to_sequences([text])[0])
      val_tweets_data.loc[:, 'text_seq'] = val_tweets_data['text'].apply(lambda text: tokenizer.texts_to_sequences([text])[0])
      test_tweets_data.loc[:, 'text_seq'] = test_tweets_data['text'].apply(lambda text: tokenizer.texts_to_sequences([text])[0])
      xxxxxxxxxx

      Pad encoded text to a max length¶

      [136]:
       
      max_length_train = train_tweets_data['text_seq'].apply(len).max()
      max_length_val = val_tweets_data['text_seq'].apply(len).max()
      max_length_test = test_tweets_data['text_seq'].apply(len).max()
      [137]:
       
      max_length_train
      [137]:
      76
      [138]:
       
      max_length_val
      [138]:
      37
      [139]:
       
      max_length_test
      [139]:
      36
      [140]:
       
      train_tweets_data['text_seq'].apply(len).mean()
      [140]:
      14.39626491888712
      [141]:
       
      val_tweets_data['text_seq'].apply(len).mean()
      [141]:
      12.773751671504517
      [142]:
       
      test_tweets_data['text_seq'].apply(len).mean()
      [142]:
      12.53067008570079
      [143]:
       
      max_length = 15  # max_length_train
      [144]:
       
      train_tweets_data.loc[:, 'text_seq_ps'] = train_tweets_data['text_seq'].apply(lambda encoded: pad_sequences([encoded], maxlen=max_length, padding='post'))
      val_tweets_data.loc[:, 'text_seq_ps'] = val_tweets_data['text_seq'].apply(lambda encoded: pad_sequences([encoded], maxlen=max_length, padding='post'))
      test_tweets_data.loc[:, 'text_seq_ps'] = test_tweets_data['text_seq'].apply(lambda encoded: pad_sequences([encoded], maxlen=max_length, padding='post'))
      xxxxxxxxxx

      Load GloVe embedding¶

      [145]:
       
      embeddings_index = {}
      with open('glove.6B.100d.txt', encoding='utf-8') as f:
          for line in f:
              values = line.split()
              word = values[0]
              coefs = np.asarray(values[1:], dtype='float32')
              embeddings_index[word] = coefs
      [146]:
       
      print('Loaded %s word vectors.' % len(embeddings_index))
      Loaded 400000 word vectors.
      
      xxxxxxxxxx

      Creating a weight matrix for words¶

      [147]:
       
      embedding_matrix = np.zeros((num_words, embedding_dim))
      ​
      for word, i in word_index.items():
          embedding_vector = embeddings_index.get(word)
          if embedding_vector is not None:
              embedding_matrix[i] = embedding_vector
      xxxxxxxxxx

      Correlation of numeric tweets data¶

      [148]:
       
      sns.set(font_scale=1.5)
      [149]:
       
      corr_threshold = 0.52
      corr = train_tweets_data[columns_to_standardize].corr()
      lower_tri = corr.where(np.tril(np.ones(corr.shape),k=-1).astype(bool)) #creating lower triangular correlation matrix
      f = plt.figure(figsize=(20, 15))
      sns.heatmap(lower_tri, cmap="PiYG", annot=True, vmin=-1, vmax=1, ax=plt.gca()) #, annot_kws={"fontsize": 16})
      high_corr = []
      for column in train_tweets_data[columns_to_standardize]:
          for col in train_tweets_data[columns_to_standardize]:
              if abs(lower_tri[column][col]) > corr_threshold:
                  high_corr.append((column, col, lower_tri[column][col]))
      high_corr = sorted(high_corr, key=lambda x: x[2], reverse=True)
      [150]:
       
      sns.set(font_scale=1)
      [151]:
       
      print("Number of columns containing high correlation:", len(set([x[0] for x in high_corr])))
      high_corr
      Number of columns containing high correlation: 0
      
      [151]:
      []
      [152]:
       
      # train_tweets_data = train_tweets_data.drop(['cleaned_tweet_len', 'quote_count'], axis=1)
      [153]:
       
      # val_tweets_data = val_tweets_data.drop(['cleaned_tweet_len', 'quote_count'], axis=1)
      [154]:
       
      # test_tweets_data = test_tweets_data.drop(['cleaned_tweet_len', 'quote_count'], axis=1)
      xxxxxxxxxx

      Split tweets data for input and output¶

      xxxxxxxxxx

      And convert inputs to tensors¶

      [155]:
       
      train_tweets_data.columns
      [155]:
      Index(['id', 'author_id', 'text', 'days_since_prev_tweet', 'created_at',
             'is_reply', 'time_of_creation', 'no_urls', 'no_hashtags',
             'org_tweet_len', 'no_mentions', 'any_media_attached',
             'contains_annotations', 'has_referenced_tweets', 'possibly_sensitive',
             'no_user_mentions', 'text_np', 'text_tk', 'text_seq', 'text_seq_ps'],
            dtype='object')
      [156]:
       
      compact_train_tweets_text_data = []
      compact_train_tweets_add_feat_data = []
      for author_id, group in train_tweets_data.groupby('author_id'):
          group = group.sort_values('created_at', ascending=False)
          author_tweets_text = []
          author_tweets_add_feat = []
          for index, row in group.iterrows():
              row_arr = []
              row_arr.append(row['days_since_prev_tweet'])
              row_arr.append(row['is_reply'])
              row_arr.append(row['time_of_creation'])
              row_arr.append(row['no_urls'])
              row_arr.append(row['no_hashtags'])
              row_arr.append(row['org_tweet_len'])
              row_arr.append(row['no_mentions'])
              row_arr.append(row['any_media_attached'])
              row_arr.append(row['contains_annotations'])
              row_arr.append(row['has_referenced_tweets'])
              row_arr.append(row['possibly_sensitive'])
              row_arr.append(row['no_user_mentions'])
              author_tweets_add_feat.append(row_arr)
              author_tweets_text.append(row['text_seq_ps'][0])
          compact_train_tweets_text_data.append(author_tweets_text)
          compact_train_tweets_add_feat_data.append(author_tweets_add_feat)
          
      compact_train_tweets_text_data = np.array(compact_train_tweets_text_data)
      compact_train_tweets_add_feat_data = np.array(compact_train_tweets_add_feat_data)
      /opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:26: VisibleDeprecationWarning:
      
      Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
      
      /opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:27: VisibleDeprecationWarning:
      
      Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
      
      
      [157]:
       
      compact_val_tweets_text_data = []
      compact_val_tweets_add_feat_data = []
      for author_id, group in val_tweets_data.groupby('author_id'):
          group = group.sort_values('created_at', ascending=False)
          author_tweets_text = []
          author_tweets_add_feat = []
          for index, row in group.iterrows():
              row_arr = []
              row_arr.append(row['days_since_prev_tweet'])
              row_arr.append(row['is_reply'])
              row_arr.append(row['time_of_creation'])
              row_arr.append(row['no_urls'])
              row_arr.append(row['no_hashtags'])
              row_arr.append(row['org_tweet_len'])
              row_arr.append(row['no_mentions'])
              row_arr.append(row['any_media_attached'])
              row_arr.append(row['contains_annotations'])
              row_arr.append(row['has_referenced_tweets'])
              row_arr.append(row['possibly_sensitive'])
              row_arr.append(row['no_user_mentions'])
              author_tweets_add_feat.append(row_arr)
              author_tweets_text.append(row['text_seq_ps'][0])
          compact_val_tweets_text_data.append(author_tweets_text)
          compact_val_tweets_add_feat_data.append(author_tweets_add_feat)
          
      compact_val_tweets_text_data = np.array(compact_val_tweets_text_data)
      compact_val_tweets_add_feat_data = np.array(compact_val_tweets_add_feat_data)
      /opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:26: VisibleDeprecationWarning:
      
      Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
      
      /opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:27: VisibleDeprecationWarning:
      
      Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
      
      
      [158]:
       
      compact_test_tweets_text_data = []
      compact_test_tweets_add_feat_data = []
      for author_id, group in test_tweets_data.groupby('author_id'):
          group = group.sort_values('created_at', ascending=False)
          author_tweets_text = []
          author_tweets_add_feat = []
          for index, row in group.iterrows():
              row_arr = []
              row_arr.append(row['days_since_prev_tweet'])
              row_arr.append(row['is_reply'])
              row_arr.append(row['time_of_creation'])
              row_arr.append(row['no_urls'])
              row_arr.append(row['no_hashtags'])
              row_arr.append(row['org_tweet_len'])
              row_arr.append(row['no_mentions'])
              row_arr.append(row['any_media_attached'])
              row_arr.append(row['contains_annotations'])
              row_arr.append(row['has_referenced_tweets'])
              row_arr.append(row['possibly_sensitive'])
              row_arr.append(row['no_user_mentions'])
              author_tweets_add_feat.append(row_arr)
              author_tweets_text.append(row['text_seq_ps'][0])
          compact_test_tweets_text_data.append(author_tweets_text)
          compact_test_tweets_add_feat_data.append(author_tweets_add_feat)
          
      compact_test_tweets_text_data = np.array(compact_test_tweets_text_data)
      compact_test_tweets_add_feat_data = np.array(compact_test_tweets_add_feat_data)
      /opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:26: VisibleDeprecationWarning:
      
      Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
      
      /opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:27: VisibleDeprecationWarning:
      
      Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
      
      
      [159]:
      x
      # train_tweets_add_feat_data_X = tf.convert_to_tensor(train_tweets_data.drop(['id', 'author_id', 'text', 'text_np', 'text_tk', 'text_seq', 'text_seq_ps'], axis=1).values, dtype=tf.float32)
      # train_tweets_text_data_X = train_tweets_data['text_seq_ps'].apply(lambda x: x[0])
      # train_tweets_text_data_X_tensor = tf.convert_to_tensor(train_tweets_text_data_X.tolist(), dtype=tf.float32)
      ​
      # val_tweets_add_feat_data_X = tf.convert_to_tensor(val_tweets_data.drop(['id', 'author_id', 'text', 'text_np', 'text_tk', 'text_seq', 'text_seq_ps'], axis=1).values, dtype=tf.float32)
      # val_tweets_text_data_X = val_tweets_data['text_seq_ps'].apply(lambda x: x[0])
      # val_tweets_text_data_X_tensor = tf.convert_to_tensor(val_tweets_text_data_X.tolist(), dtype=tf.float32)
      ​
      # test_tweets_add_feat_data_X = tf.convert_to_tensor(test_tweets_data.drop(['id', 'author_id', 'text', 'text_np', 'text_tk', 'text_seq', 'text_seq_ps'], axis=1).values, dtype=tf.float32)
      # test_tweets_text_data_X = test_tweets_data['text_seq_ps'].apply(lambda x: x[0])
      # test_tweets_text_data_X_tensor = tf.convert_to_tensor(test_tweets_text_data_X.tolist(), dtype=tf.float32)
      xxxxxxxxxx

      Reformat tweets data¶

      [160]:
       
      compact_train_tweets_text_data.shape
      [160]:
      (6946,)
      [161]:
       
      np.array(compact_train_tweets_text_data[0]).shape
      [161]:
      (66, 15)
      [162]:
       
      max_l = 0
      for arr_user_tweets_feat in compact_train_tweets_text_data:
          curr = np.array(arr_user_tweets_feat).shape[0]
          if curr > max_l:
              max_l = curr
      max_l
      [162]:
      3397
      [163]:
       
      max_user_tweets_num = 20 # max_l
      [164]:
       
      compact_train_tweets_text_data_padded = pad_sequences(compact_train_tweets_text_data, maxlen=max_user_tweets_num, padding='post', truncating='post', dtype='float32')
      compact_val_tweets_text_data_padded = pad_sequences(compact_val_tweets_text_data, maxlen=max_user_tweets_num, padding='post', truncating='post', dtype='float32')
      compact_test_tweets_text_data_padded = pad_sequences(compact_test_tweets_text_data, maxlen=max_user_tweets_num, padding='post', truncating='post', dtype='float32')
      len(compact_train_tweets_text_data_padded)
      [164]:
      6946
      [165]:
       
      compact_train_tweets_add_feat_data_padded = pad_sequences(compact_train_tweets_add_feat_data, maxlen=max_user_tweets_num, padding='post', truncating='post', dtype='float32')
      compact_val_tweets_add_feat_data_padded = pad_sequences(compact_val_tweets_add_feat_data, maxlen=max_user_tweets_num, padding='post', truncating='post', dtype='float32')
      compact_test_tweets_add_feat_data_padded = pad_sequences(compact_test_tweets_add_feat_data, maxlen=max_user_tweets_num, padding='post', truncating='post', dtype='float32')
      len(compact_train_tweets_add_feat_data_padded)
      [165]:
      6946
      [166]:
       
      train_users_data_Y.shape
      [166]:
      (6946, 1)
      [167]:
       
      np.array(train_users_data_X).shape
      [167]:
      (6946, 6)
      [168]:
       
      np.array(compact_train_tweets_text_data_padded[0]).shape
      [168]:
      (20, 15)
      xxxxxxxxxx


      xxxxxxxxxx

      DNN models¶

      xxxxxxxxxx

      Function to load a saved neural network model¶

      [169]:
       
      from keras.models import load_model
      ​
      def load_model_from_file(filepath):
        model = load_model(filepath)
        return model
      xxxxxxxxxx

      Summary of metrics based on real and predicted data by the network¶

      [170]:
       
      def get_model_metrics(test_Y, out_Y):
        accuracy = accuracy_score(test_Y, out_Y)
        print('Accuracy: {}'.format(accuracy))
        # precision tp / (tp + fp)
        precision = precision_score(test_Y, out_Y, average=None)
        print('Precision: {}'.format(precision))
        # recall: tp / (tp + fn)
        recall = recall_score(test_Y, out_Y)
        print('Recall: {}'.format(recall))
        # f1: 2 tp / (2 tp + fp + fn)
        f1 = f1_score(test_Y, out_Y)
        print('F1 score: %f' % f1)
        # ROC AUC
        auc = roc_auc_score(test_Y, out_Y)
        print('ROC AUC: %f' % auc)
        return (accuracy, precision, recall, f1, auc)
      xxxxxxxxxx

      Creating a confusion matrix¶

      [171]:
       
      def create_confusion_matrix(test_Y, out_Y):
          cm = sklearn.metrics.confusion_matrix(test_Y, out_Y)
      ​
          group_counts = ["{0:0.0f}".format(value) for value in cm.flatten()]
          group_percentages = ["{0:.2%}".format(value) for value in cm.flatten()/np.sum(cm)]
          labels = [f"{v1}\n\n{v2}" for v1, v2 in zip(group_counts,group_percentages)]
          labels = np.asarray(labels).reshape(2,2)
      ​
          plt.figure()
          fig = plt.figure(figsize=(4,4))
          ax = fig.add_subplot(111)
      ​
          sns.heatmap(
              cm,
              annot=labels,
              annot_kws={"size": 12},
              fmt='',
              cmap=plt.cm.Blues,
              cbar=False
          )
          ax.set_title("Confusion matrix", fontsize=14)
          ax.set_xticklabels(ax.get_xticklabels(), fontsize=12)
          ax.set_yticklabels(ax.get_yticklabels(), fontsize=12)
          ax.set_ylabel('True', fontsize=12)
          ax.set_xlabel('Predicted', fontsize=12)
          
          fig.show()
      xxxxxxxxxx

      Neural network models¶

      [172]:
       
      from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Input, Concatenate, concatenate, Masking
      [173]:
       
      # EarlyStopping
      def early_stop(metric='val_accuracy', mode = 'max', patience=50):
          return EarlyStopping(monitor='val_accuracy',
                                 patience=patience,
                                 restore_best_weights=True,
                                 mode=mode)
      # PlotLosses
      def plot_losses():
          return PlotLossesCallback()
      ​
      # ModelCheckpoint
      def checkpoint_callback(model_name):
          return ModelCheckpoint(filepath = models_path + '/' + model_name + '.hdf5',
                                  monitor = "val_accuracy",
                                  save_best_only = True,
                                  # save_weights_only = True,
                                  verbose=1)
      [174]:
       
      def train_model(model, model_name, train_X, train_Y, val_X, val_Y, batch_size, epochs, patience=50):
          model.fit(train_X, train_Y, batch_size=batch_size, epochs=epochs,
                      validation_data=(val_X, val_Y),
                      callbacks=[plot_losses(),
                                 early_stop(metric='val_accuracy', mode = 'max', patience=patience),
                                 checkpoint_callback(model_name)])
          return model
      [175]:
       
      def prediction_and_metrics(model, test_X, test_Y):
          out_Y_org = model.predict(test_X, verbose=0)
          out_Y = [0 if x < 0.5 else 1 for x in out_Y_org]
      ​
          x = range(0, len(test_Y))
          fig = plt.figure(figsize=(18, 4))
          colors = ['blue' if val == 0. else 'red' for val in np.asarray(test_Y)]
          plt.scatter(x, out_Y_org, marker='.', label ='predicted', c=colors)
          plt.plot(x, [0.5] * len(test_Y), c='orange')
          plt.ylim((0,1))
      ​
          create_confusion_matrix(test_Y, out_Y)
          get_model_metrics(test_Y, out_Y)
      xxxxxxxxxx

      Model 1. (only additional features of tweets)¶

      xxxxxxxxxx

      Create model¶

      [ ]:
       
      def create_model_1(add_tweets_feat_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)):
          
          # Additional tweet's features input
          additional_tweet_input = Input(shape=add_tweets_feat_shape) 
          
          masked_input = Masking(mask_value=0.0)(additional_tweet_input)
          
          lstm1 = LSTM(64, return_sequences=True)(masked_input)
          lstm2 = LSTM(64)(lstm1)
          
          dropout = Dropout(0.5)(lstm2)
          output_layer = Dense(1, activation='sigmoid')(dropout)
      ​
          model = keras.Model(inputs=additional_tweet_input, outputs=output_layer)
          model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
      ​
      #     dense_layer1 = Dense(128)(concatenated)
      #     activation_layer1 = Activation('relu')(dense_layer1)
      #     output_layer = Dense(1, activation='sigmoid')(activation_layer1)
          
      #     model = Model(inputs=[text_input, additional_tweet_input], outputs=output_layer)
          
      #     model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
      ​
          return model
      xxxxxxxxxx

      batch_size=50, epochs=300¶

      xxxxxxxxxx

      Create and train model¶

      [183]:
       
      model_name = 'model_tweets_data_based_10000_1_v1_batch_size_50_20_latest_tweets_of_user_padded_add_feat_only'
      ​
      model = create_model_1(add_tweets_feat_shape=np.array(compact_train_tweets_add_feat_data_padded[0]).shape,
                             optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
      ​
      [184]:
      x
      model =  train_model(model, 
                            model_name, 
                            train_X=compact_train_tweets_add_feat_data_padded,
                            train_Y=train_users_data_Y,
                            val_X=compact_val_tweets_add_feat_data_padded,
                            val_Y=val_users_data_Y,
                            batch_size=50, 
                            epochs=400,
                            patience=100)
      accuracy
      	training         	 (min:    0.499, max:    0.901, cur:    0.896)
      	validation       	 (min:    0.465, max:    0.509, cur:    0.476)
      Loss
      	training         	 (min:    0.195, max:    0.696, cur:    0.199)
      	validation       	 (min:    0.694, max:    2.743, cur:    2.587)
      
      Epoch 101: val_accuracy did not improve from 0.50942
      139/139 [==============================] - 6s 43ms/step - loss: 0.1986 - accuracy: 0.8963 - val_loss: 2.5874 - val_accuracy: 0.4764
      
      xxxxxxxxxx

      Prediction and results¶

      [185]:
       
      prediction_and_metrics(model, compact_test_tweets_add_feat_data_padded, test_users_data_Y)
      Accuracy: 0.48013698630136986
      Precision: [0.47966339 0.48058902]
      Recall: 0.4917808219178082
      F1 score: 0.486121
      ROC AUC: 0.480137
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      batch_size=100, epochs=300¶

      xxxxxxxxxx

      Create and train model¶

      [186]:
       
      model_name = 'model_tweets_data_based_10000_1_v1_batch_size_100_20_latest_tweets_of_user_padded_add_feat_only'
      ​
      model = create_model_1(add_tweets_feat_shape=np.array(compact_train_tweets_add_feat_data_padded[0]).shape,
                             optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
      ​
      [187]:
       
      model =  train_model(model, 
                            model_name, 
                            train_X=compact_train_tweets_add_feat_data_padded,
                            train_Y=train_users_data_Y,
                            val_X=compact_val_tweets_add_feat_data_padded,
                            val_Y=val_users_data_Y,
                            batch_size=100, 
                            epochs=400,
                            patience=100)
      accuracy
      	training         	 (min:    0.504, max:    0.896, cur:    0.896)
      	validation       	 (min:    0.472, max:    0.513, cur:    0.493)
      Loss
      	training         	 (min:    0.197, max:    0.696, cur:    0.197)
      	validation       	 (min:    0.693, max:    2.659, cur:    2.636)
      
      Epoch 117: val_accuracy did not improve from 0.51279
      70/70 [==============================] - 4s 61ms/step - loss: 0.1965 - accuracy: 0.8958 - val_loss: 2.6361 - val_accuracy: 0.4933
      
      xxxxxxxxxx

      Prediction and results¶

      [188]:
       
      prediction_and_metrics(model, compact_test_tweets_add_feat_data_padded, test_users_data_Y)
      Accuracy: 0.49383561643835616
      Precision: [0.49454545 0.49291339]
      Recall: 0.42876712328767125
      F1 score: 0.458608
      ROC AUC: 0.493836
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      batch_size=250, epochs=300¶

      xxxxxxxxxx

      Create and train model¶

      [189]:
       
      model_name = 'model_tweets_data_based_10000_1_v1_batch_size_250_20_latest_tweets_of_user_padded_add_feat_only'
      ​
      model = create_model_1(add_tweets_feat_shape=np.array(compact_train_tweets_add_feat_data_padded[0]).shape,
                             optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
      [190]:
       
      model =  train_model(model, 
                            model_name, 
                            train_X=compact_train_tweets_add_feat_data_padded,
                            train_Y=train_users_data_Y,
                            val_X=compact_val_tweets_add_feat_data_padded,
                            val_Y=val_users_data_Y,
                            batch_size=250, 
                            epochs=400,
                            patience=100)
      accuracy
      	training         	 (min:    0.498, max:    0.815, cur:    0.812)
      	validation       	 (min:    0.480, max:    0.516, cur:    0.504)
      Loss
      	training         	 (min:    0.336, max:    0.695, cur:    0.336)
      	validation       	 (min:    0.693, max:    1.879, cur:    1.879)
      
      Epoch 103: val_accuracy did not improve from 0.51615
      28/28 [==============================] - 3s 108ms/step - loss: 0.3357 - accuracy: 0.8115 - val_loss: 1.8786 - val_accuracy: 0.5040
      
      xxxxxxxxxx

      Prediction and results¶

      [191]:
       
      prediction_and_metrics(model, compact_test_tweets_add_feat_data_padded, test_users_data_Y)
      Accuracy: 0.476027397260274
      Precision: [0.46534653 0.48167539]
      Recall: 0.6301369863013698
      F1 score: 0.545994
      ROC AUC: 0.476027
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      Prediction on training subset¶

      [192]:
       
      prediction_and_metrics(model, compact_train_tweets_add_feat_data_padded, train_users_data_Y)
      Accuracy: 0.5185718399078606
      Precision: [0.52686381 0.51419142]
      Recall: 0.6729052692196947
      F1 score: 0.582938
      ROC AUC: 0.518572
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      batch_size=500, epochs=300¶

      xxxxxxxxxx

      Create and train model¶

      [193]:
       
      model_name = 'model_tweets_data_based_10000_1_v1_batch_size_500_20_latest_tweets_of_user_padded_add_feat_only'
      ​
      model = create_model_1(add_tweets_feat_shape=np.array(compact_train_tweets_add_feat_data_padded[0]).shape,
                             optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
      [194]:
       
      model =  train_model(model, 
                            model_name, 
                            train_X=compact_train_tweets_add_feat_data_padded,
                            train_Y=train_users_data_Y,
                            val_X=compact_val_tweets_add_feat_data_padded,
                            val_Y=val_users_data_Y,
                            batch_size=500, 
                            epochs=400,
                            patience=100)
      accuracy
      	training         	 (min:    0.499, max:    0.930, cur:    0.920)
      	validation       	 (min:    0.485, max:    0.517, cur:    0.515)
      Loss
      	training         	 (min:    0.119, max:    0.695, cur:    0.148)
      	validation       	 (min:    0.694, max:    4.214, cur:    2.817)
      
      Epoch 339: val_accuracy did not improve from 0.51682
      14/14 [==============================] - 3s 191ms/step - loss: 0.1481 - accuracy: 0.9198 - val_loss: 2.8168 - val_accuracy: 0.5155
      
      xxxxxxxxxx

      Prediction and results¶

      [195]:
       
      prediction_and_metrics(model, compact_test_tweets_add_feat_data_padded, test_users_data_Y)
      Accuracy: 0.5
      Precision: [0.5 0.5]
      Recall: 0.5178082191780822
      F1 score: 0.508748
      ROC AUC: 0.500000
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      Model 0. (only additional features of tweets)¶

      xxxxxxxxxx

      Create model¶

      [423]:
       
      def create_model_0(add_tweets_feat_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)):
          
          # Additional tweet's features input
          additional_tweet_input = Input(shape=add_tweets_feat_shape) 
          
          masked_input = Masking(mask_value=0.0)(additional_tweet_input)
          
          # lstm1 = LSTM(128, return_sequences=True)(masked_input)
          # lstm1_dropout1 = Dropout(0.2)(lstm1)
          # lstm2 = LSTM(64)(lstm1_dropout1)
          
          flatten_layer1 = Flatten()(masked_input)
          
          danse_layer1 = Dense(64, activation='relu')(flatten_layer1)
                      
          dropout = Dropout(0.2)(danse_layer1)
          output_layer = Dense(1, activation='sigmoid')(dropout)
      ​
          model = keras.Model(inputs=additional_tweet_input, outputs=output_layer)
          model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
      ​
      #     dense_layer1 = Dense(128)(concatenated)
      #     activation_layer1 = Activation('relu')(dense_layer1)
      #     output_layer = Dense(1, activation='sigmoid')(activation_layer1)
          
      #     model = Model(inputs=[text_input, additional_tweet_input], outputs=output_layer)
          
      #     model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
      ​
          return model
      xxxxxxxxxx

      batch_size=250, epochs=300¶

      xxxxxxxxxx

      Create and train model¶

      [424]:
       
      model_name = 'model_tweets_data_based_10000_0_v1_batch_size_250_20_latest_tweets_of_user_padded_add_feat_only'
      ​
      model = create_model_0(add_tweets_feat_shape=np.array(compact_train_tweets_add_feat_data_padded[0]).shape,
                             optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
      ​
      [425]:
       
      model =  train_model(model, 
                            model_name, 
                            train_X=compact_train_tweets_add_feat_data_padded,
                            train_Y=train_users_data_Y,
                            val_X=compact_val_tweets_add_feat_data_padded,
                            val_Y=val_users_data_Y,
                            batch_size=250, 
                            epochs=400,
                            patience=100)
      accuracy
      	training         	 (min:    0.490, max:    0.795, cur:    0.795)
      	validation       	 (min:    0.483, max:    0.535, cur:    0.507)
      Loss
      	training         	 (min:    0.421, max:    0.768, cur:    0.426)
      	validation       	 (min:    0.714, max:    1.137, cur:    0.996)
      
      Epoch 102: val_accuracy did not improve from 0.53499
      28/28 [==============================] - 1s 35ms/step - loss: 0.4258 - accuracy: 0.7946 - val_loss: 0.9959 - val_accuracy: 0.5067
      
      xxxxxxxxxx

      Prediction and results¶

      [426]:
       
      prediction_and_metrics(model, compact_test_tweets_add_feat_data_padded, test_users_data_Y)
      Accuracy: 0.5198630136986301
      Precision: [0.52056738 0.5192053 ]
      Recall: 0.536986301369863
      F1 score: 0.527946
      ROC AUC: 0.519863
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      Prediction and results of training set¶

      [427]:
       
      prediction_and_metrics(model, compact_train_tweets_add_feat_data_padded, train_users_data_Y)
      Accuracy: 0.5407428735963145
      Precision: [0.54253081 0.5390992 ]
      Recall: 0.5617621652749784
      F1 score: 0.550197
      ROC AUC: 0.540743
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      Model 2. (only additional features of tweets)¶

      xxxxxxxxxx

      Create model¶

      [196]:
       
      def create_model_2(add_tweets_feat_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)):
          
          # Additional tweet's features input
          additional_tweet_input = Input(shape=add_tweets_feat_shape) 
          
          masked_input = Masking(mask_value=0.0)(additional_tweet_input)
          
          lstm1 = LSTM(128, return_sequences=True)(masked_input)
          lstm1_dropout1 = Dropout(0.2)(lstm1)
          lstm2 = LSTM(64)(lstm1_dropout1)
          
          dropout = Dropout(0.2)(lstm2)
          output_layer = Dense(1, activation='sigmoid')(dropout)
      ​
          model = keras.Model(inputs=additional_tweet_input, outputs=output_layer)
          model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
      ​
      #     dense_layer1 = Dense(128)(concatenated)
      #     activation_layer1 = Activation('relu')(dense_layer1)
      #     output_layer = Dense(1, activation='sigmoid')(activation_layer1)
          
      #     model = Model(inputs=[text_input, additional_tweet_input], outputs=output_layer)
          
      #     model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
      ​
          return model
      xxxxxxxxxx

      batch_size=250, epochs=300¶

      xxxxxxxxxx

      Create and train model¶

      [197]:
       
      model_name = 'model_tweets_data_based_10000_2_v1_batch_size_250_20_latest_tweets_of_user_padded_add_feat_only'
      ​
      model = create_model_2(add_tweets_feat_shape=np.array(compact_train_tweets_add_feat_data_padded[0]).shape,
                             optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
      ​
      [198]:
       
      model =  train_model(model, 
                            model_name, 
                            train_X=compact_train_tweets_add_feat_data_padded,
                            train_Y=train_users_data_Y,
                            val_X=compact_val_tweets_add_feat_data_padded,
                            val_Y=val_users_data_Y,
                            batch_size=250, 
                            epochs=400,
                            patience=100)
      accuracy
      	training         	 (min:    0.500, max:    0.918, cur:    0.915)
      	validation       	 (min:    0.464, max:    0.513, cur:    0.489)
      Loss
      	training         	 (min:    0.148, max:    0.695, cur:    0.148)
      	validation       	 (min:    0.693, max:    2.818, cur:    2.770)
      
      Epoch 122: val_accuracy did not improve from 0.51279
      28/28 [==============================] - 4s 147ms/step - loss: 0.1482 - accuracy: 0.9148 - val_loss: 2.7697 - val_accuracy: 0.4886
      
      xxxxxxxxxx

      Prediction and results¶

      [199]:
       
      prediction_and_metrics(model, compact_test_tweets_add_feat_data_padded, test_users_data_Y)
      Accuracy: 0.5020547945205479
      Precision: [0.50189633 0.50224215]
      Recall: 0.4602739726027397
      F1 score: 0.480343
      ROC AUC: 0.502055
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      Model 3. (only additional features of tweets)¶

      xxxxxxxxxx

      Create model¶

      [200]:
       
      def create_model_3(add_tweets_feat_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)):
          
          # Additional tweet's features input
          additional_tweet_input = Input(shape=add_tweets_feat_shape) 
          
          masked_input = Masking(mask_value=0.0)(additional_tweet_input)
          
          lstm1 = LSTM(128, return_sequences=True)(masked_input)
          lstm1_dropout1 = Dropout(0.2)(lstm1)
          lstm2 = LSTM(64)(lstm1_dropout1)
          lstm2_dropout = Dropout(0.2)(lstm2)
          
          dense_layer1 = Dense(64)(lstm2_dropout)
          dense_layer1_activation_layer1 = Activation('relu')(dense_layer1)
          dense_layer1_dropout1 = Dropout(0.1)(dense_layer1_activation_layer1)
          output_layer = Dense(1, activation='sigmoid')(dense_layer1_dropout1)
      ​
          model = keras.Model(inputs=additional_tweet_input, outputs=output_layer)
          model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
      ​
      #     dense_layer1 = Dense(128)(concatenated)
      #     activation_layer1 = Activation('relu')(dense_layer1)
      #     output_layer = Dense(1, activation='sigmoid')(activation_layer1)
          
      #     model = Model(inputs=[text_input, additional_tweet_input], outputs=output_layer)
          
      #     model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
      ​
          return model
      xxxxxxxxxx

      batch_size=250, epochs=300¶

      xxxxxxxxxx

      Create and train model¶

      [201]:
       
      model_name = 'model_tweets_data_based_10000_3_v1_batch_size_250_20_latest_tweets_of_user_padded_add_feat_only'
      ​
      model = create_model_3(add_tweets_feat_shape=np.array(compact_train_tweets_add_feat_data_padded[0]).shape,
                             optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
      ​
      [202]:
       
      model =  train_model(model, 
                            model_name, 
                            train_X=compact_train_tweets_add_feat_data_padded,
                            train_Y=train_users_data_Y,
                            val_X=compact_val_tweets_add_feat_data_padded,
                            val_Y=val_users_data_Y,
                            batch_size=250, 
                            epochs=400,
                            patience=100)
      accuracy
      	training         	 (min:    0.500, max:    0.909, cur:    0.908)
      	validation       	 (min:    0.481, max:    0.515, cur:    0.491)
      Loss
      	training         	 (min:    0.173, max:    0.694, cur:    0.173)
      	validation       	 (min:    0.692, max:    2.711, cur:    2.674)
      
      Epoch 109: val_accuracy did not improve from 0.51548
      28/28 [==============================] - 4s 143ms/step - loss: 0.1728 - accuracy: 0.9083 - val_loss: 2.6736 - val_accuracy: 0.4906
      
      xxxxxxxxxx

      Prediction and results¶

      [203]:
       
      prediction_and_metrics(model, compact_test_tweets_add_feat_data_padded, test_users_data_Y)
      Accuracy: 0.49931506849315066
      Precision: [0.49941107 0.49918167]
      Recall: 0.4178082191780822
      F1 score: 0.454884
      ROC AUC: 0.499315
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      Model 4. (only additional features of tweets)¶

      xxxxxxxxxx

      Create model¶

      [204]:
       
      def create_model_4(add_tweets_feat_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)):
          
          # Additional tweet's features input
          additional_tweet_input = Input(shape=add_tweets_feat_shape) 
          
          masked_input = Masking(mask_value=0.0)(additional_tweet_input)
          
          lstm1 = LSTM(64, return_sequences=False)(masked_input)
          lstm1_dropout1 = Dropout(0.2)(lstm1)
          # lstm2 = LSTM(64)(lstm1_dropout1)
          # lstm2_dropout = Dropout(0.2)(lstm2)
          
          dense_layer1 = Dense(64)(lstm1_dropout1)
          dense_layer1_activation_layer1 = Activation('relu')(dense_layer1)
          dense_layer1_dropout1 = Dropout(0.1)(dense_layer1_activation_layer1)
          output_layer = Dense(1, activation='sigmoid')(dense_layer1_dropout1)
      ​
          model = keras.Model(inputs=additional_tweet_input, outputs=output_layer)
          model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
      ​
          return model
      xxxxxxxxxx

      batch_size=250, epochs=300¶

      xxxxxxxxxx

      Create and train model¶

      [205]:
       
      model_name = 'model_tweets_data_based_10000_4_v1_batch_size_250_20_latest_tweets_of_user_padded_add_feat_only'
      ​
      model = create_model_4(add_tweets_feat_shape=np.array(compact_train_tweets_add_feat_data_padded[0]).shape,
                             optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
      ​
      [206]:
       
      model =  train_model(model, 
                            model_name, 
                            train_X=compact_train_tweets_add_feat_data_padded,
                            train_Y=train_users_data_Y,
                            val_X=compact_val_tweets_add_feat_data_padded,
                            val_Y=val_users_data_Y,
                            batch_size=250, 
                            epochs=400,
                            patience=100)
      accuracy
      	training         	 (min:    0.500, max:    0.875, cur:    0.868)
      	validation       	 (min:    0.482, max:    0.521, cur:    0.502)
      Loss
      	training         	 (min:    0.239, max:    0.696, cur:    0.258)
      	validation       	 (min:    0.694, max:    2.371, cur:    2.371)
      
      Epoch 153: val_accuracy did not improve from 0.52086
      28/28 [==============================] - 2s 74ms/step - loss: 0.2583 - accuracy: 0.8675 - val_loss: 2.3705 - val_accuracy: 0.5020
      
      xxxxxxxxxx

      Prediction and results¶

      [207]:
       
      prediction_and_metrics(model, compact_test_tweets_add_feat_data_padded, test_users_data_Y)
      Accuracy: 0.49726027397260275
      Precision: [0.49725275 0.49726776]
      Recall: 0.4986301369863014
      F1 score: 0.497948
      ROC AUC: 0.497260
      
      <Figure size 640x480 with 0 Axes>
      [ ]:
       
      ​
      xxxxxxxxxx

      Model 5. (only additional features of tweets)¶

      xxxxxxxxxx

      Create model¶

      [208]:
       
      def create_model_5(add_tweets_feat_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)):
          
          # Additional tweet's features input
          additional_tweet_input = Input(shape=add_tweets_feat_shape) 
          
          masked_input = Masking(mask_value=0.0)(additional_tweet_input)
          
          lstm1 = LSTM(64, return_sequences=True)(masked_input)
          lstm2 = LSTM(64)(lstm1)
          
          output_layer = Dense(1, activation='sigmoid')(lstm2)
      ​
          model = keras.Model(inputs=additional_tweet_input, outputs=output_layer)
          model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
      ​
          return model
      xxxxxxxxxx

      batch_size=250, epochs=300¶

      xxxxxxxxxx

      Create and train model¶

      [209]:
       
      model_name = 'model_tweets_data_based_10000_5_v1_batch_size_250_20_latest_tweets_of_user_padded_add_feat_only'
      ​
      model = create_model_5(add_tweets_feat_shape=np.array(compact_train_tweets_add_feat_data_padded[0]).shape,
                             optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
      ​
      [210]:
       
      model =  train_model(model, 
                            model_name, 
                            train_X=compact_train_tweets_add_feat_data_padded,
                            train_Y=train_users_data_Y,
                            val_X=compact_val_tweets_add_feat_data_padded,
                            val_Y=val_users_data_Y,
                            batch_size=250, 
                            epochs=400,
                            patience=100)
      accuracy
      	training         	 (min:    0.500, max:    0.900, cur:    0.897)
      	validation       	 (min:    0.469, max:    0.517, cur:    0.489)
      Loss
      	training         	 (min:    0.185, max:    0.694, cur:    0.192)
      	validation       	 (min:    0.693, max:    2.429, cur:    2.429)
      
      Epoch 101: val_accuracy did not improve from 0.51750
      28/28 [==============================] - 3s 110ms/step - loss: 0.1921 - accuracy: 0.8968 - val_loss: 2.4288 - val_accuracy: 0.4892
      
      xxxxxxxxxx

      Prediction and results¶

      [211]:
       
      prediction_and_metrics(model, compact_test_tweets_add_feat_data_padded, test_users_data_Y)
      Accuracy: 0.46986301369863015
      Precision: [0.46167247 0.4751693 ]
      Recall: 0.5767123287671233
      F1 score: 0.521040
      ROC AUC: 0.469863
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      Model 6. (only additional features of tweets)¶

      xxxxxxxxxx

      Create model¶

      [212]:
       
      def create_model_6(add_tweets_feat_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)):
          
          # Additional tweet's features input
          additional_tweet_input = Input(shape=add_tweets_feat_shape) 
          
          masked_input = Masking(mask_value=0.0)(additional_tweet_input)
          
          lstm1 = LSTM(64, return_sequences=True, recurrent_dropout=0.2)(masked_input)
          lstm2 = LSTM(64, recurrent_dropout=0.1)(lstm1)
          
          output_layer = Dense(1, activation='sigmoid')(lstm2)
      ​
          model = keras.Model(inputs=additional_tweet_input, outputs=output_layer)
          model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
      ​
          return model
      xxxxxxxxxx

      batch_size=250, epochs=300¶

      xxxxxxxxxx

      Create and train model¶

      [213]:
       
      model_name = 'model_tweets_data_based_10000_6_v1_batch_size_250_20_latest_tweets_of_user_padded_add_feat_only'
      ​
      model = create_model_6(add_tweets_feat_shape=np.array(compact_train_tweets_add_feat_data_padded[0]).shape,
                             optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
      ​
      [214]:
       
      model =  train_model(model, 
                            model_name, 
                            train_X=compact_train_tweets_add_feat_data_padded,
                            train_Y=train_users_data_Y,
                            val_X=compact_val_tweets_add_feat_data_padded,
                            val_Y=val_users_data_Y,
                            batch_size=250, 
                            epochs=400,
                            patience=100)
      accuracy
      	training         	 (min:    0.501, max:    0.926, cur:    0.926)
      	validation       	 (min:    0.491, max:    0.522, cur:    0.499)
      Loss
      	training         	 (min:    0.145, max:    0.695, cur:    0.145)
      	validation       	 (min:    0.694, max:    2.625, cur:    2.625)
      
      Epoch 219: val_accuracy did not improve from 0.52221
      28/28 [==============================] - 3s 100ms/step - loss: 0.1450 - accuracy: 0.9260 - val_loss: 2.6252 - val_accuracy: 0.4993
      
      xxxxxxxxxx

      Prediction and results¶

      [215]:
       
      prediction_and_metrics(model, compact_test_tweets_add_feat_data_padded, test_users_data_Y)
      Accuracy: 0.5123287671232877
      Precision: [0.51246537 0.51219512]
      Recall: 0.5178082191780822
      F1 score: 0.514986
      ROC AUC: 0.512329
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      Model 7. (only additional features of tweets)¶

      xxxxxxxxxx

      Create model¶

      [221]:
       
      def create_model_7(add_tweets_feat_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)):
          
          # Additional tweet's features input
          additional_tweet_input = Input(shape=add_tweets_feat_shape) 
          
          masked_input = Masking(mask_value=0.0)(additional_tweet_input)
          
          lstm1 = LSTM(64, return_sequences=True, activation="relu")(masked_input)
          lstm2 = LSTM(64, activation="relu")(lstm1)
          
          output_layer = Dense(1, activation='sigmoid')(lstm2)
      ​
          model = keras.Model(inputs=additional_tweet_input, outputs=output_layer)
          model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
      ​
          return model
      xxxxxxxxxx

      batch_size=250, epochs=300¶

      xxxxxxxxxx

      Create and train model¶

      [217]:
       
      model_name = 'model_tweets_data_based_10000_7_v1_batch_size_250_20_latest_tweets_of_user_padded_add_feat_only'
      ​
      model = create_model_7(add_tweets_feat_shape=np.array(compact_train_tweets_add_feat_data_padded[0]).shape,
                             optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
      ​
      [218]:
       
      model =  train_model(model, 
                            model_name, 
                            train_X=compact_train_tweets_add_feat_data_padded,
                            train_Y=train_users_data_Y,
                            val_X=compact_val_tweets_add_feat_data_padded,
                            val_Y=val_users_data_Y,
                            batch_size=250, 
                            epochs=400,
                            patience=100)
      accuracy
      	training         	 (min:    0.497, max:    0.914, cur:    0.911)
      	validation       	 (min:    0.478, max:    0.522, cur:    0.501)
      Loss
      	training         	 (min:    0.149, max:    0.702, cur:    0.165)
      	validation       	 (min:    0.693, max:    4.006, cur:    3.745)
      
      Epoch 173: val_accuracy did not improve from 0.52221
      28/28 [==============================] - 3s 116ms/step - loss: 0.1645 - accuracy: 0.9109 - val_loss: 3.7446 - val_accuracy: 0.5007
      
      xxxxxxxxxx

      Prediction and results¶

      [219]:
       
      prediction_and_metrics(model, compact_test_tweets_add_feat_data_padded, test_users_data_Y)
      Accuracy: 0.5061643835616438
      Precision: [0.50576184 0.50662739]
      Recall: 0.4712328767123288
      F1 score: 0.488290
      ROC AUC: 0.506164
      
      <Figure size 640x480 with 0 Axes>
      [222]:
       
      prediction_and_metrics(model, compact_train_tweets_add_feat_data_padded, train_users_data_Y)
      Accuracy: 0.7333717247336596
      Precision: [0.71527224 0.75479409]
      Recall: 0.6913331413763317
      F1 score: 0.721671
      ROC AUC: 0.733372
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      Model 7a. (only additional features of tweets)¶

      xxxxxxxxxx

      Create model¶

      [175]:
       
      def create_model_7a(add_tweets_feat_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)):
          
          # Additional tweet's features input
          additional_tweet_input = Input(shape=add_tweets_feat_shape) 
          
          masked_input = Masking(mask_value=0.0)(additional_tweet_input)
          
          lstm1 = LSTM(64, return_sequences=True, activation="relu")(masked_input)
          lstm2 = LSTM(512, activation="relu")(lstm1)
          
          output_layer = Dense(1, activation='sigmoid')(lstm2)
      ​
          model = keras.Model(inputs=additional_tweet_input, outputs=output_layer)
          model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
      ​
          return model
      2023-09-03 16:23:42.684511: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda/lib:/usr/local/lib/x86_64-linux-gnu:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
      2023-09-03 16:23:42.684567: W tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: UNKNOWN ERROR (303)
      2023-09-03 16:23:42.684610: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (b0f306797141): /proc/driver/nvidia/version does not exist
      2023-09-03 16:23:42.686593: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
      To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
      
      xxxxxxxxxx

      batch_size=250, epochs=300¶

      xxxxxxxxxx

      Create and train model¶

      [176]:
       
      model_name = 'model_tweets_data_based_10000_7a_v1_batch_size_250_20_latest_tweets_of_user_padded_add_feat_only'
      ​
      model = create_model_7a(add_tweets_feat_shape=np.array(compact_train_tweets_add_feat_data_padded[0]).shape,
                             optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
      ​
      [ ]:
       
      model =  train_model(model, 
                            model_name, 
                            train_X=compact_train_tweets_add_feat_data_padded,
                            train_Y=train_users_data_Y,
                            val_X=compact_val_tweets_add_feat_data_padded,
                            val_Y=val_users_data_Y,
                            batch_size=250, 
                            epochs=400,
                            patience=100)
      accuracy
      	training         	 (min:    0.497, max:    0.966, cur:    0.960)
      	validation       	 (min:    0.483, max:    0.529, cur:    0.513)
      Loss
      	training         	 (min:    0.064, max:    0.715, cur:    0.076)
      	validation       	 (min:    0.693, max:    4.596, cur:    3.335)
      
      Epoch 299: val_accuracy did not improve from 0.52894
      28/28 [==============================] - 11s 400ms/step - loss: 0.0759 - accuracy: 0.9601 - val_loss: 3.3347 - val_accuracy: 0.5135
      Epoch 300/400
       7/28 [======>.......................] - ETA: 8s - loss: 0.0702 - accuracy: 0.9680
      xxxxxxxxxx

      Prediction and results¶

      [ ]:
       
      prediction_and_metrics(model, compact_test_tweets_add_feat_data_padded, test_users_data_Y)
      [ ]:
       
      prediction_and_metrics(model, compact_train_tweets_add_feat_data_padded, train_users_data_Y)
      xxxxxxxxxx

      Model 7b. (only additional features of tweets)¶

      xxxxxxxxxx

      Create model¶

      [ ]:
       
      def create_model_7b(add_tweets_feat_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)):
          
          # Additional tweet's features input
          additional_tweet_input = Input(shape=add_tweets_feat_shape) 
          
          masked_input = Masking(mask_value=0.0)(additional_tweet_input)
          
          lstm1 = LSTM(64, return_sequences=True, activation="relu", kernel_regularizer =tf.keras.regularizers.l1( l=0.01))(masked_input)
          lstm2 = LSTM(128, activation="relu", kernel_regularizer =tf.keras.regularizers.l1( l=0.01))(lstm1)
          
          output_layer = Dense(1, activation='sigmoid')(lstm2)
      ​
          model = keras.Model(inputs=additional_tweet_input, outputs=output_layer)
          model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
      ​
          return model
      xxxxxxxxxx

      batch_size=250, epochs=400¶

      [ ]:
       
      model_name = 'model_tweets_data_based_10000_7b_v1_batch_size_250_20_latest_tweets_of_user_padded_add_feat_only'
      ​
      model = create_model_7b(add_tweets_feat_shape=np.array(compact_train_tweets_add_feat_data_padded[0]).shape,
                             optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
      ​
      [ ]:
       
      model =  train_model(model, 
                            model_name, 
                            train_X=compact_train_tweets_add_feat_data_padded,
                            train_Y=train_users_data_Y,
                            val_X=compact_val_tweets_add_feat_data_padded,
                            val_Y=val_users_data_Y,
                            batch_size=250, 
                            epochs=400,
                            patience=100)
      xxxxxxxxxx

      Prediction and results¶

      [ ]:
       
      prediction_and_metrics(model, compact_test_tweets_add_feat_data_padded, test_users_data_Y)
      xxxxxxxxxx

      Prediction and results of training set¶

      [ ]:
       
      prediction_and_metrics(model, compact_train_tweets_add_feat_data_padded, train_users_data_Y)
      xxxxxxxxxx

      Model 8. (only additional features of tweets)¶

      xxxxxxxxxx

      Create model¶

      [223]:
       
      def create_model_8(add_tweets_feat_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)):
          
          # Additional tweet's features input
          additional_tweet_input = Input(shape=add_tweets_feat_shape) 
          
          masked_input = Masking(mask_value=0.0)(additional_tweet_input)
          
          lstm1 = LSTM(16, return_sequences=True)(masked_input)
          lstm2 = LSTM(16)(lstm1)
          
          output_layer = Dense(1, activation='sigmoid')(lstm2)
      ​
          model = keras.Model(inputs=additional_tweet_input, outputs=output_layer)
          model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
      ​
          return model
      xxxxxxxxxx

      batch_size=250, epochs=300¶

      xxxxxxxxxx

      Create and train model¶

      [224]:
       
      model_name = 'model_tweets_data_based_10000_8_v1_batch_size_250_20_latest_tweets_of_user_padded_add_feat_only'
      ​
      model = create_model_8(add_tweets_feat_shape=np.array(compact_train_tweets_add_feat_data_padded[0]).shape,
                             optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
      ​
      [225]:
       
      model =  train_model(model, 
                            model_name, 
                            train_X=compact_train_tweets_add_feat_data_padded,
                            train_Y=train_users_data_Y,
                            val_X=compact_val_tweets_add_feat_data_padded,
                            val_Y=val_users_data_Y,
                            batch_size=250, 
                            epochs=400,
                            patience=100)
      accuracy
      	training         	 (min:    0.491, max:    0.682, cur:    0.678)
      	validation       	 (min:    0.484, max:    0.518, cur:    0.497)
      Loss
      	training         	 (min:    0.563, max:    0.694, cur:    0.563)
      	validation       	 (min:    0.693, max:    0.878, cur:    0.877)
      
      Epoch 109: val_accuracy did not improve from 0.51817
      28/28 [==============================] - 2s 59ms/step - loss: 0.5630 - accuracy: 0.6784 - val_loss: 0.8774 - val_accuracy: 0.4973
      
      xxxxxxxxxx

      Prediction and results¶

      [226]:
       
      prediction_and_metrics(model, compact_test_tweets_add_feat_data_padded, test_users_data_Y)
      Accuracy: 0.4986301369863014
      Precision: [0.4987715  0.49845201]
      Recall: 0.4410958904109589
      F1 score: 0.468023
      ROC AUC: 0.498630
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      CNN¶

      [363]:
       
      from keras.layers import SimpleRNN
      xxxxxxxxxx

      Model 9. (only additional features of tweets)¶

      xxxxxxxxxx

      Create model¶

      [186]:
       
      def create_model_9(add_tweets_feat_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)):
          
          # Additional tweet's features input
          additional_tweet_input = Input(shape=add_tweets_feat_shape) 
          
          masked_input = Masking(mask_value=0.0)(additional_tweet_input)
          
          cnn_layer1 = Conv1D(filters=64, kernel_size=3, activation='relu')(masked_input)
          # features_cnn_layer = MaxPooling1D()(features_cnn_layer)
          flatten_layer1 = Flatten()(cnn_layer1)
          
          output_layer = Dense(1, activation='sigmoid')(flatten_layer1)
      ​
          model = keras.Model(inputs=additional_tweet_input, outputs=output_layer)
          model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
      ​
          return model
      xxxxxxxxxx

      batch_size=250, epochs=400¶

      xxxxxxxxxx

      Create and train model¶

      [187]:
       
      model_name = 'model_tweets_data_based_10000_9_v1_batch_size_250_20_latest_tweets_of_user_padded_tweet_text_only'
      ​
      model = create_model_9(add_tweets_feat_shape=np.array(compact_train_tweets_text_data_padded[0]).shape,
                             optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
      ​
      [188]:
       
      model =  train_model(model, 
                            model_name, 
                            train_X=compact_train_tweets_text_data_padded,
                            train_Y=train_users_data_Y,
                            val_X=compact_val_tweets_text_data_padded,
                            val_Y=val_users_data_Y,
                            batch_size=250, 
                            epochs=400,
                            patience=100)
      accuracy
      	training         	 (min:    0.503, max:    0.871, cur:    0.843)
      	validation       	 (min:    0.480, max:    0.520, cur:    0.499)
      Loss
      	training         	 (min:   26.574, max: 5978.037, cur:   46.327)
      	validation       	 (min:  623.592, max: 2703.696, cur:  789.925)
      
      Epoch 202: val_accuracy did not improve from 0.52019
      28/28 [==============================] - 1s 32ms/step - loss: 46.3273 - accuracy: 0.8431 - val_loss: 789.9250 - val_accuracy: 0.4987
      
      xxxxxxxxxx

      Prediction and results¶

      [189]:
       
      prediction_and_metrics(model, compact_test_tweets_text_data_padded, test_users_data_Y)
      Accuracy: 0.5095890410958904
      Precision: [0.51100629 0.50849515]
      Recall: 0.5739726027397261
      F1 score: 0.539254
      ROC AUC: 0.509589
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      Model 10. (only additional features of tweets)¶

      xxxxxxxxxx

      Create model¶

      [194]:
       
      def create_model_10(add_tweets_feat_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)):
          
          # Additional tweet's features input
          additional_tweet_input = Input(shape=add_tweets_feat_shape) 
          
          masked_input = Masking(mask_value=0.0)(additional_tweet_input)
          
          cnn_layer1 = Conv1D(filters=64, kernel_size=3, activation='relu')(masked_input)
          pooling_layer1 = MaxPooling1D(pool_size=2, strides=2, padding='valid')(cnn_layer1)
          flatten_layer1 = Flatten()(pooling_layer1)
          
          dropout_layer1 = Dropout(0.2)(flatten_layer1)
          
          output_layer = Dense(1, activation='sigmoid')(dense_layer1)
      ​
          model = keras.Model(inputs=additional_tweet_input, outputs=output_layer)
          model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
      ​
          return model
      xxxxxxxxxx

      batch_size=250, epochs=400¶

      xxxxxxxxxx

      Create and train model¶

      [195]:
       
      model_name = 'model_tweets_data_based_10000_10_v1_batch_size_250_20_latest_tweets_of_user_padded_tweet_text_only'
      ​
      model = create_model_10(add_tweets_feat_shape=np.array(compact_train_tweets_text_data_padded[0]).shape,
                             optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
      ​
      [196]:
       
      model =  train_model(model, 
                            model_name, 
                            train_X=compact_train_tweets_text_data_padded,
                            train_Y=train_users_data_Y,
                            val_X=compact_val_tweets_text_data_padded,
                            val_Y=val_users_data_Y,
                            batch_size=250, 
                            epochs=400,
                            patience=100)
      accuracy
      	training         	 (min:    0.488, max:    0.518, cur:    0.502)
      	validation       	 (min:    0.483, max:    0.518, cur:    0.499)
      Loss
      	training         	 (min:    0.797, max: 8489.423, cur:    0.964)
      	validation       	 (min:    0.692, max: 1467.734, cur:    0.693)
      
      Epoch 107: val_accuracy did not improve from 0.51817
      28/28 [==============================] - 1s 32ms/step - loss: 0.9644 - accuracy: 0.5022 - val_loss: 0.6929 - val_accuracy: 0.4993
      
      xxxxxxxxxx

      Prediction and results¶

      [197]:
       
      prediction_and_metrics(model, compact_test_tweets_text_data_padded, test_users_data_Y)
      Accuracy: 0.5068493150684932
      Precision: [0.50595238 0.50806452]
      Recall: 0.4315068493150685
      F1 score: 0.466667
      ROC AUC: 0.506849
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      Prediction and results on training set¶

      [201]:
       
      prediction_and_metrics(model, compact_train_tweets_text_data_padded, train_users_data_Y)
      Accuracy: 0.506334581053844
      Precision: [0.50554435 0.50738751]
      Recall: 0.43507054419809965
      F1 score: 0.468455
      ROC AUC: 0.506335
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      Model 11. (only additional features of tweets)¶

      xxxxxxxxxx

      Create model¶

      [216]:
       
      def create_model_11(add_tweets_feat_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)):
          
          # Additional tweet's features input
          additional_tweet_input = Input(shape=add_tweets_feat_shape) 
          
          masked_input = Masking(mask_value=0.0)(additional_tweet_input)
          
          cnn_layer1 = Conv1D(filters=64, kernel_size=3, activation='relu')(masked_input)
          pooling_layer1 = MaxPooling1D(pool_size=4, strides=2, padding='valid')(cnn_layer1)
          flatten_layer1 = Flatten()(pooling_layer1)
          
          dropout_layer1 = Dropout(0.2)(flatten_layer1)
          
          output_layer = Dense(1, activation='sigmoid')(dropout_layer1)
      ​
          model = keras.Model(inputs=additional_tweet_input, outputs=output_layer)
          model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
      ​
          return model
      xxxxxxxxxx

      batch_size=250, epochs=400¶

      xxxxxxxxxx

      Create and train model¶

      [217]:
       
      model_name = 'model_tweets_data_based_10000_11_v1_batch_size_250_20_latest_tweets_of_user_padded_tweet_text_only'
      ​
      model = create_model_11(add_tweets_feat_shape=np.array(compact_train_tweets_text_data_padded[0]).shape,
                             optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
      ​
      [218]:
       
      model =  train_model(model, 
                            model_name, 
                            train_X=compact_train_tweets_text_data_padded,
                            train_Y=train_users_data_Y,
                            val_X=compact_val_tweets_text_data_padded,
                            val_Y=val_users_data_Y,
                            batch_size=250, 
                            epochs=400,
                            patience=100)
      accuracy
      	training         	 (min:    0.508, max:    0.579, cur:    0.533)
      	validation       	 (min:    0.475, max:    0.528, cur:    0.503)
      Loss
      	training         	 (min:    2.630, max: 11262.217, cur:    2.655)
      	validation       	 (min:    4.687, max: 4742.995, cur:    4.936)
      
      Epoch 134: val_accuracy did not improve from 0.52826
      28/28 [==============================] - 1s 33ms/step - loss: 2.6553 - accuracy: 0.5334 - val_loss: 4.9364 - val_accuracy: 0.5034
      
      xxxxxxxxxx

      Prediction and results¶

      [219]:
       
      prediction_and_metrics(model, compact_test_tweets_text_data_padded, test_users_data_Y)
      Accuracy: 0.49794520547945204
      Precision: [0.49677419 0.49849246]
      Recall: 0.6794520547945205
      F1 score: 0.575072
      ROC AUC: 0.497945
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      Prediction and results on training set¶

      [220]:
       
      prediction_and_metrics(model, compact_train_tweets_text_data_padded, train_users_data_Y)
      Accuracy: 0.5813417794414051
      Precision: [0.63012437 0.5591623 ]
      Recall: 0.7687877915346962
      F1 score: 0.647430
      ROC AUC: 0.581342
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      Model 12. (only text of tweets)¶

      xxxxxxxxxx

      Create model¶

      [224]:
       
      def create_model_12(add_tweets_feat_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)):
          
          # Additional tweet's features input
          additional_tweet_input = Input(shape=add_tweets_feat_shape) 
          
          masked_input = Masking(mask_value=0.0)(additional_tweet_input)
          
          cnn_layer1 = Conv1D(filters=16, kernel_size=3, activation='relu')(masked_input)
          pooling_layer1 = MaxPooling1D(pool_size=2, strides=1, padding='valid')(cnn_layer1)
          flatten_layer1 = Flatten()(pooling_layer1)
          
          dropout_layer1 = Dropout(0.2)(flatten_layer1)
          
          output_layer = Dense(1, activation='sigmoid')(dropout_layer1)
      ​
          model = keras.Model(inputs=additional_tweet_input, outputs=output_layer)
          model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
      ​
          return model
      xxxxxxxxxx

      batch_size=250, epochs=400¶

      xxxxxxxxxx

      Create and train model¶

      [225]:
       
      model_name = 'model_tweets_data_based_10000_12_v1_batch_size_250_20_latest_tweets_of_user_padded_tweet_text_only'
      ​
      model = create_model_12(add_tweets_feat_shape=np.array(compact_train_tweets_text_data_padded[0]).shape,
                             optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
      ​
      [226]:
       
      model =  train_model(model, 
                            model_name, 
                            train_X=compact_train_tweets_text_data_padded,
                            train_Y=train_users_data_Y,
                            val_X=compact_val_tweets_text_data_padded,
                            val_Y=val_users_data_Y,
                            batch_size=250, 
                            epochs=400,
                            patience=100)
      accuracy
      	training         	 (min:    0.498, max:    0.533, cur:    0.512)
      	validation       	 (min:    0.483, max:    0.533, cur:    0.508)
      Loss
      	training         	 (min:    1.001, max: 17936.178, cur:    1.036)
      	validation       	 (min:    1.909, max: 7260.793, cur:    1.944)
      
      Epoch 176: val_accuracy did not improve from 0.53297
      28/28 [==============================] - 1s 32ms/step - loss: 1.0360 - accuracy: 0.5117 - val_loss: 1.9438 - val_accuracy: 0.5081
      
      xxxxxxxxxx

      Prediction and results¶

      [227]:
       
      prediction_and_metrics(model, compact_test_tweets_text_data_padded, test_users_data_Y)
      Accuracy: 0.4917808219178082
      Precision: [0.49393939 0.48723404]
      Recall: 0.3136986301369863
      F1 score: 0.381667
      ROC AUC: 0.491781
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      Prediction and results on training set¶

      [228]:
       
      prediction_and_metrics(model, compact_train_tweets_text_data_padded, train_users_data_Y)
      Accuracy: 0.5200115174200979
      Precision: [0.5146285  0.53166287]
      Recall: 0.3360207313561762
      F1 score: 0.411785
      ROC AUC: 0.520012
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      Model 13. (only additional features of tweets)¶

      xxxxxxxxxx

      Create model¶

      [238]:
       
      def create_model_13(add_tweets_feat_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)):
          
          # Additional tweet's features input
          additional_tweet_input = Input(shape=add_tweets_feat_shape) 
          print(additional_tweet_input.shape)
          
          masked_input = Masking(mask_value=0.0)(additional_tweet_input)
          print(masked_input.shape)
          
          cnn_layer1 = Conv1D(filters=15, kernel_size=3, activation='relu')(masked_input)
          print(cnn_layer1.shape)
          # pooling_layer1 = MaxPooling1D(pool_size=4, strides=1, padding='valid')(cnn_layer1)
          flatten_layer1 = Flatten()(cnn_layer1)
          print(flatten_layer1.shape)
          
          dropout_layer1 = Dropout(0.2)(flatten_layer1)
          
          output_layer = Dense(1, activation='sigmoid')(dropout_layer1)
      ​
          model = keras.Model(inputs=additional_tweet_input, outputs=output_layer)
          model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
      ​
          return model
      xxxxxxxxxx

      batch_size=250, epochs=400¶

      xxxxxxxxxx

      Create and train model¶

      [239]:
       
      model_name = 'model_tweets_data_based_10000_13_v1_batch_size_250_20_latest_tweets_of_user_padded_tweet_text_only'
      ​
      model = create_model_13(add_tweets_feat_shape=np.array(compact_train_tweets_text_data_padded[0]).shape,
                             optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
      ​
      (None, 20, 15)
      (None, 20, 15)
      (None, 18, 15)
      (None, 270)
      
      [240]:
       
      model =  train_model(model, 
                            model_name, 
                            train_X=compact_train_tweets_text_data_padded,
                            train_Y=train_users_data_Y,
                            val_X=compact_val_tweets_text_data_padded,
                            val_Y=val_users_data_Y,
                            batch_size=250, 
                            epochs=400,
                            patience=100)
      accuracy
      	training         	 (min:    0.485, max:    0.541, cur:    0.506)
      	validation       	 (min:    0.471, max:    0.535, cur:    0.495)
      Loss
      	training         	 (min:    2.628, max: 12445.366, cur:    2.777)
      	validation       	 (min:    2.351, max: 6456.548, cur:    2.813)
      
      Epoch 139: val_accuracy did not improve from 0.53499
      28/28 [==============================] - 1s 30ms/step - loss: 2.7770 - accuracy: 0.5056 - val_loss: 2.8129 - val_accuracy: 0.4946
      
      xxxxxxxxxx

      Prediction and results¶

      [241]:
       
      prediction_and_metrics(model, compact_test_tweets_text_data_padded, test_users_data_Y)
      Accuracy: 0.5116438356164383
      Precision: [0.51185495 0.51144011]
      Recall: 0.5205479452054794
      F1 score: 0.515954
      ROC AUC: 0.511644
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      Prediction and results on training set¶

      [242]:
       
      prediction_and_metrics(model, compact_train_tweets_text_data_padded, train_users_data_Y)
      Accuracy: 0.5192916786639793
      Precision: [0.51918671 0.5193978 ]
      Recall: 0.5165562913907285
      F1 score: 0.517973
      ROC AUC: 0.519292
      
      <Figure size 640x480 with 0 Axes>
      [ ]:
       
      ​
      [ ]:
       
      ​
      [ ]:
       
      ​
      [ ]:
       
      ​
      xxxxxxxxxx



      xxxxxxxxxx

      Model 1. (only text of tweets)¶

      xxxxxxxxxx

      Create model¶

      [275]:
       
      def create_model_1(num_words, embedding_dim, embedding_matrix, max_sequence_length, trainable,
                         tweets_text_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)):
          
          # Tweets text input
          text_input = Input(shape=tweets_text_shape)
          
          # Embedding layer for text
          embedding_layer = Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=max_sequence_length, trainable=trainable)(text_input)
          
          masked_input = Masking(mask_value=0.0)(embedding_layer)
          
          # flatten = Flatten()(masked_input)
          # print(flatten.shape)
          # Reshape layer to flatten only the last two dimensions
          reshape = Reshape((tweets_text_shape[0], tweets_text_shape[1]*embedding_dim))(masked_input)
      ​
          # lstm1 = LSTM(64, return_sequences=True)(masked_input)
          lstm2 = LSTM(64)(reshape)
          
          # dropout = Dropout(0.5)(lstm2)
          output_layer = Dense(1, activation='sigmoid')(lstm2)
      ​
          model = keras.Model(inputs=text_input, outputs=output_layer)
          model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
      ​
      #     dense_layer1 = Dense(128)(concatenated)
      #     activation_layer1 = Activation('relu')(dense_layer1)
      #     output_layer = Dense(1, activation='sigmoid')(activation_layer1)
          
      #     model = Model(inputs=[text_input, additional_tweet_input], outputs=output_layer)
          
      #     model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
      ​
          return model
      xxxxxxxxxx

      batch_size=50, epochs=300¶

      xxxxxxxxxx

      Create and train model¶

      [276]:
       
      model_name = 'model_tweets_data_based_10000_1_v2_batch_size_50_20_latest_tweets_of_user_padded_tweets_text_only'
      ​
      model = create_model_1(num_words=num_words, embedding_dim=embedding_dim, 
                             embedding_matrix=embedding_matrix, max_sequence_length=max_length, 
                             trainable=False, 
                             tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape,
                             optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
      ​
      [277]:
       
      model =  train_model(model, 
                            model_name, 
                            train_X=compact_train_tweets_text_data_padded,
                            train_Y=train_users_data_Y,
                            val_X=compact_val_tweets_text_data_padded,
                            val_Y=val_users_data_Y,
                            batch_size=50, 
                            epochs=400,
                            patience=100)
      accuracy
      	training         	 (min:    0.499, max:    1.000, cur:    0.998)
      	validation       	 (min:    0.491, max:    0.532, cur:    0.515)
      Loss
      	training         	 (min:    0.001, max:    0.703, cur:    0.005)
      	validation       	 (min:    0.693, max:    4.547, cur:    3.618)
      
      Epoch 273: val_accuracy did not improve from 0.53163
      139/139 [==============================] - 6s 40ms/step - loss: 0.0051 - accuracy: 0.9983 - val_loss: 3.6175 - val_accuracy: 0.5155
      
      xxxxxxxxxx

      Prediction and results¶

      [278]:
       
      prediction_and_metrics(model, compact_test_tweets_text_data_padded, test_users_data_Y)
      Accuracy: 0.5082191780821917
      Precision: [0.50826446 0.50817439]
      Recall: 0.510958904109589
      F1 score: 0.509563
      ROC AUC: 0.508219
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      batch_size=50, epochs=300¶

      xxxxxxxxxx

      Create and train model¶

      [276]:
       
      model_name = 'model_tweets_data_based_10000_1_v1_batch_size_50_20_latest_tweets_of_user_padded_tweets_text_only'
      ​
      model = create_model_1(num_words=num_words, embedding_dim=embedding_dim, 
                             embedding_matrix=embedding_matrix, max_sequence_length=max_length, 
                             trainable=False, 
                             tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape,
                             optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
      ​
      [277]:
       
      model =  train_model(model, 
                            model_name, 
                            train_X=compact_train_tweets_text_data_padded,
                            train_Y=train_users_data_Y,
                            val_X=compact_val_tweets_text_data_padded,
                            val_Y=val_users_data_Y,
                            batch_size=50, 
                            epochs=400,
                            patience=100)
      accuracy
      	training         	 (min:    0.499, max:    1.000, cur:    0.998)
      	validation       	 (min:    0.491, max:    0.532, cur:    0.515)
      Loss
      	training         	 (min:    0.001, max:    0.703, cur:    0.005)
      	validation       	 (min:    0.693, max:    4.547, cur:    3.618)
      
      Epoch 273: val_accuracy did not improve from 0.53163
      139/139 [==============================] - 6s 40ms/step - loss: 0.0051 - accuracy: 0.9983 - val_loss: 3.6175 - val_accuracy: 0.5155
      
      xxxxxxxxxx

      Prediction and results¶

      [278]:
       
      prediction_and_metrics(model, compact_test_tweets_text_data_padded, test_users_data_Y)
      Accuracy: 0.5082191780821917
      Precision: [0.50826446 0.50817439]
      Recall: 0.510958904109589
      F1 score: 0.509563
      ROC AUC: 0.508219
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      batch_size=100, epochs=300¶

      xxxxxxxxxx

      Create and train model¶

      [280]:
       
      model_name = 'model_tweets_data_based_10000_1_v1_batch_size_100_20_latest_tweets_of_user_padded_tweets_text_only'
      model = create_model_1(num_words=num_words, embedding_dim=embedding_dim, 
                             embedding_matrix=embedding_matrix, max_sequence_length=max_length, 
                             trainable=False, 
                             tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape,
                             optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
      [281]:
       
      model =  train_model(model, 
                            model_name, 
                            train_X=compact_train_tweets_text_data_padded,
                            train_Y=train_users_data_Y,
                            val_X=compact_val_tweets_text_data_padded,
                            val_Y=val_users_data_Y,
                            batch_size=100, 
                            epochs=400,
                            patience=100)
      accuracy
      	training         	 (min:    0.506, max:    1.000, cur:    0.999)
      	validation       	 (min:    0.490, max:    0.526, cur:    0.522)
      Loss
      	training         	 (min:    0.001, max:    0.703, cur:    0.002)
      	validation       	 (min:    0.696, max:    4.314, cur:    3.750)
      
      Epoch 108: val_accuracy did not improve from 0.52624
      70/70 [==============================] - 4s 56ms/step - loss: 0.0016 - accuracy: 0.9993 - val_loss: 3.7500 - val_accuracy: 0.5215
      
      xxxxxxxxxx

      Prediction and results¶

      [282]:
       
      prediction_and_metrics(model, compact_test_tweets_text_data_padded, test_users_data_Y)
      Accuracy: 0.49726027397260275
      Precision: [0.49784946 0.49622642]
      Recall: 0.36027397260273974
      F1 score: 0.417460
      ROC AUC: 0.497260
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      batch_size=250, epochs=300¶

      xxxxxxxxxx

      Create and train model¶

      [287]:
       
      model_name = 'model_tweets_data_based_10000_1_v1_batch_size_250_20_latest_tweets_of_user_padded_tweets_text_only'
      model = create_model_1(num_words=num_words, embedding_dim=embedding_dim, 
                             embedding_matrix=embedding_matrix, max_sequence_length=max_length, 
                             trainable=False, 
                             tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape,
                             optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
      [288]:
       
      model =  train_model(model, 
                            model_name, 
                            train_X=compact_train_tweets_text_data_padded,
                            train_Y=train_users_data_Y,
                            val_X=compact_val_tweets_text_data_padded,
                            val_Y=val_users_data_Y,
                            batch_size=250, 
                            epochs=400,
                            patience=100)
      accuracy
      	training         	 (min:    0.507, max:    1.000, cur:    0.999)
      	validation       	 (min:    0.476, max:    0.521, cur:    0.485)
      Loss
      	training         	 (min:    0.001, max:    0.707, cur:    0.002)
      	validation       	 (min:    0.696, max:    3.933, cur:    3.933)
      
      Epoch 126: val_accuracy did not improve from 0.52086
      28/28 [==============================] - 3s 114ms/step - loss: 0.0019 - accuracy: 0.9993 - val_loss: 3.9332 - val_accuracy: 0.4852
      
      xxxxxxxxxx

      Prediction and results¶

      [289]:
       
      prediction_and_metrics(model, compact_test_tweets_text_data_padded, test_users_data_Y)
      Accuracy: 0.5095890410958904
      Precision: [0.50902062 0.51023392]
      Recall: 0.4780821917808219
      F1 score: 0.493635
      ROC AUC: 0.509589
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      Prediction on training subset¶

      [290]:
       
      prediction_and_metrics(model, compact_train_tweets_text_data_padded, train_users_data_Y)
      Accuracy: 0.9985603224877627
      Precision: [0.99827338 0.99884759]
      Recall: 0.9982723869853153
      F1 score: 0.998560
      ROC AUC: 0.998560
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      batch_size=500, epochs=300¶

      xxxxxxxxxx

      Create and train model¶

      [291]:
       
      model_name = 'model_tweets_data_based_10000_1_v1_batch_size_500_20_latest_tweets_of_user_padded_tweets_text_only'
      model = create_model_1(num_words=num_words, embedding_dim=embedding_dim, 
                             embedding_matrix=embedding_matrix, max_sequence_length=max_length, 
                             trainable=False, 
                             tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape,
                             optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
      [292]:
       
      model =  train_model(model, 
                            model_name, 
                            train_X=compact_train_tweets_text_data_padded,
                            train_Y=train_users_data_Y,
                            val_X=compact_val_tweets_text_data_padded,
                            val_Y=val_users_data_Y,
                            batch_size=500, 
                            epochs=400,
                            patience=100)
      accuracy
      	training         	 (min:    0.498, max:    1.000, cur:    0.999)
      	validation       	 (min:    0.479, max:    0.517, cur:    0.509)
      Loss
      	training         	 (min:    0.001, max:    0.706, cur:    0.001)
      	validation       	 (min:    0.693, max:    3.862, cur:    3.835)
      
      Epoch 101: val_accuracy did not improve from 0.51750
      14/14 [==============================] - 3s 200ms/step - loss: 0.0014 - accuracy: 0.9994 - val_loss: 3.8347 - val_accuracy: 0.5094
      
      xxxxxxxxxx

      Prediction and results¶

      [293]:
       
      prediction_and_metrics(model, compact_test_tweets_text_data_padded, test_users_data_Y)
      Accuracy: 0.5089041095890411
      Precision: [0.50821745 0.50971599]
      Recall: 0.4671232876712329
      F1 score: 0.487491
      ROC AUC: 0.508904
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      Model 2. (only text of tweets)¶

      xxxxxxxxxx

      Create model¶

      [299]:
       
      def create_model_2(num_words, embedding_dim, embedding_matrix, max_sequence_length, trainable,
                         tweets_text_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)):
          
          # Tweets text input
          text_input = Input(shape=tweets_text_shape)
          
          # Embedding layer for text
          embedding_layer = Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=max_sequence_length, trainable=trainable)(text_input)
          
          masked_input = Masking(mask_value=0.0)(embedding_layer)
          
          # flatten = Flatten()(masked_input)
          # print(flatten.shape)
          # Reshape layer to flatten only the last two dimensions
          reshape = Reshape((tweets_text_shape[0], tweets_text_shape[1]*embedding_dim))(masked_input)
      ​
      ​
          lstm1 = LSTM(128, return_sequences=True)(reshape)
          lstm1_dropout1 = Dropout(0.2)(lstm1)
          lstm2 = LSTM(64)(lstm1_dropout1)
          
          dropout = Dropout(0.2)(lstm2)
          output_layer = Dense(1, activation='sigmoid')(dropout)
      ​
          model = keras.Model(inputs=text_input, outputs=output_layer)
          model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
      ​
      #     dense_layer1 = Dense(128)(concatenated)
      #     activation_layer1 = Activation('relu')(dense_layer1)
      #     output_layer = Dense(1, activation='sigmoid')(activation_layer1)
          
      #     model = Model(inputs=[text_input, additional_tweet_input], outputs=output_layer)
          
      #     model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
      ​
          return model
      xxxxxxxxxx

      batch_size=250, epochs=300¶

      xxxxxxxxxx

      Create and train model¶

      [300]:
       
      model_name = 'model_tweets_data_based_10000_2_v1_batch_size_250_20_latest_tweets_of_user_padded_tweets_text_only'
      ​
      model = create_model_2(num_words=num_words, embedding_dim=embedding_dim, 
                             embedding_matrix=embedding_matrix, max_sequence_length=max_length, 
                             trainable=False, 
                             tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape,
                             optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
      ​
      [301]:
       
      model =  train_model(model, 
                            model_name, 
                            train_X=compact_train_tweets_text_data_padded,
                            train_Y=train_users_data_Y,
                            val_X=compact_val_tweets_text_data_padded,
                            val_Y=val_users_data_Y,
                            batch_size=250, 
                            epochs=400,
                            patience=100)
      accuracy
      	training         	 (min:    0.504, max:    1.000, cur:    0.998)
      	validation       	 (min:    0.484, max:    0.514, cur:    0.504)
      Loss
      	training         	 (min:    0.001, max:    0.700, cur:    0.008)
      	validation       	 (min:    0.697, max:    4.317, cur:    2.814)
      
      Epoch 116: val_accuracy did not improve from 0.51413
      28/28 [==============================] - 6s 227ms/step - loss: 0.0080 - accuracy: 0.9977 - val_loss: 2.8140 - val_accuracy: 0.5040
      
      xxxxxxxxxx

      Prediction and results¶

      [302]:
       
      prediction_and_metrics(model, compact_test_tweets_text_data_padded, test_users_data_Y)
      Accuracy: 0.5027397260273972
      Precision: [0.503003   0.50251889]
      Recall: 0.5465753424657535
      F1 score: 0.523622
      ROC AUC: 0.502740
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      Model 3. (only text of tweets)¶

      xxxxxxxxxx

      Create model¶

      [303]:
       
      def create_model_3(num_words, embedding_dim, embedding_matrix, max_sequence_length, trainable,
                         tweets_text_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)):
          
          # Tweets text input
          text_input = Input(shape=tweets_text_shape)
          
          # Embedding layer for text
          embedding_layer = Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=max_sequence_length, trainable=trainable)(text_input)
          
          masked_input = Masking(mask_value=0.0)(embedding_layer)
          
          # flatten = Flatten()(masked_input)
          # print(flatten.shape)
          # Reshape layer to flatten only the last two dimensions
          reshape = Reshape((tweets_text_shape[0], tweets_text_shape[1]*embedding_dim))(masked_input)
          
          lstm1 = LSTM(128, return_sequences=True)(reshape)
          lstm1_dropout1 = Dropout(0.2)(lstm1)
          lstm2 = LSTM(64)(lstm1_dropout1)
          lstm2_dropout = Dropout(0.2)(lstm2)
          
          dense_layer1 = Dense(64)(lstm2_dropout)
          dense_layer1_activation_layer1 = Activation('relu')(dense_layer1)
          dense_layer1_dropout1 = Dropout(0.1)(dense_layer1_activation_layer1)
          output_layer = Dense(1, activation='sigmoid')(dense_layer1_dropout1)
      ​
          model = keras.Model(inputs=text_input, outputs=output_layer)
          model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
      ​
      #     dense_layer1 = Dense(128)(concatenated)
      #     activation_layer1 = Activation('relu')(dense_layer1)
      #     output_layer = Dense(1, activation='sigmoid')(activation_layer1)
          
      #     model = Model(inputs=[text_input, additional_tweet_input], outputs=output_layer)
          
      #     model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
      ​
          return model
      xxxxxxxxxx

      batch_size=250, epochs=300¶

      xxxxxxxxxx

      Create and train model¶

      [304]:
       
      model_name = 'model_tweets_data_based_10000_3_v1_batch_size_250_20_latest_tweets_of_user_padded_tweets_text_only'
      ​
      model = create_model_3(num_words=num_words, embedding_dim=embedding_dim, 
                             embedding_matrix=embedding_matrix, max_sequence_length=max_length, 
                             trainable=False, 
                             tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape,
                             optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
      [305]:
       
      model =  train_model(model, 
                            model_name, 
                            train_X=compact_train_tweets_text_data_padded,
                            train_Y=train_users_data_Y,
                            val_X=compact_val_tweets_text_data_padded,
                            val_Y=val_users_data_Y,
                            batch_size=250, 
                            epochs=400,
                            patience=100)
      accuracy
      	training         	 (min:    0.514, max:    1.000, cur:    0.999)
      	validation       	 (min:    0.480, max:    0.520, cur:    0.495)
      Loss
      	training         	 (min:    0.001, max:    0.694, cur:    0.001)
      	validation       	 (min:    0.695, max:    4.917, cur:    4.917)
      
      Epoch 121: val_accuracy did not improve from 0.52019
      28/28 [==============================] - 6s 215ms/step - loss: 9.6534e-04 - accuracy: 0.9994 - val_loss: 4.9168 - val_accuracy: 0.4946
      
      xxxxxxxxxx

      Prediction and results¶

      [306]:
       
      prediction_and_metrics(model, compact_test_tweets_text_data_padded, test_users_data_Y)
      Accuracy: 0.5164383561643836
      Precision: [0.51612903 0.51675978]
      Recall: 0.5068493150684932
      F1 score: 0.511757
      ROC AUC: 0.516438
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      Model 4. (only text of tweets)¶

      xxxxxxxxxx

      Create model¶

      [307]:
       
      def create_model_4(num_words, embedding_dim, embedding_matrix, max_sequence_length, trainable,
                         tweets_text_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)):
          
          # Tweets text input
          text_input = Input(shape=tweets_text_shape)
          
          # Embedding layer for text
          embedding_layer = Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=max_sequence_length, trainable=trainable)(text_input)
          
          masked_input = Masking(mask_value=0.0)(embedding_layer)
          
          # flatten = Flatten()(masked_input)
          # print(flatten.shape)
          # Reshape layer to flatten only the last two dimensions
          reshape = Reshape((tweets_text_shape[0], tweets_text_shape[1]*embedding_dim))(masked_input)
          
          lstm1 = LSTM(64, return_sequences=False)(reshape)
          lstm1_dropout1 = Dropout(0.2)(lstm1)
          # lstm2 = LSTM(64)(lstm1_dropout1)
          # lstm2_dropout = Dropout(0.2)(lstm2)
          
          dense_layer1 = Dense(64)(lstm1_dropout1)
          dense_layer1_activation_layer1 = Activation('relu')(dense_layer1)
          dense_layer1_dropout1 = Dropout(0.1)(dense_layer1_activation_layer1)
          output_layer = Dense(1, activation='sigmoid')(dense_layer1_dropout1)
      ​
          model = keras.Model(inputs=text_input, outputs=output_layer)
          model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
      ​
          return model
      xxxxxxxxxx

      batch_size=250, epochs=300¶

      xxxxxxxxxx

      Create and train model¶

      [308]:
       
      model_name = 'model_tweets_data_based_10000_4_v1_batch_size_250_20_latest_tweets_of_user_padded_tweets_text_only'
      ​
      model = create_model_4(num_words=num_words, embedding_dim=embedding_dim, 
                             embedding_matrix=embedding_matrix, max_sequence_length=max_length, 
                             trainable=False, 
                             tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape,
                             optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
      [309]:
       
      model =  train_model(model, 
                            model_name, 
                            train_X=compact_train_tweets_text_data_padded,
                            train_Y=train_users_data_Y,
                            val_X=compact_val_tweets_text_data_padded,
                            val_Y=val_users_data_Y,
                            batch_size=250, 
                            epochs=400,
                            patience=100)
      accuracy
      	training         	 (min:    0.517, max:    1.000, cur:    0.998)
      	validation       	 (min:    0.487, max:    0.526, cur:    0.499)
      Loss
      	training         	 (min:    0.001, max:    0.696, cur:    0.006)
      	validation       	 (min:    0.697, max:    4.915, cur:    3.429)
      
      Epoch 238: val_accuracy did not improve from 0.52557
      28/28 [==============================] - 4s 147ms/step - loss: 0.0059 - accuracy: 0.9983 - val_loss: 3.4287 - val_accuracy: 0.4987
      
      xxxxxxxxxx

      Prediction and results¶

      [310]:
       
      prediction_and_metrics(model, compact_test_tweets_text_data_padded, test_users_data_Y)
      Accuracy: 0.5054794520547945
      Precision: [0.50600601 0.50503778]
      Recall: 0.5493150684931507
      F1 score: 0.526247
      ROC AUC: 0.505479
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      Model 5. (only text of tweets)¶

      xxxxxxxxxx

      Create model¶

      [312]:
       
      def create_model_5(num_words, embedding_dim, embedding_matrix, max_sequence_length, trainable,
                         tweets_text_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)):
          
          # Tweets text input
          text_input = Input(shape=tweets_text_shape)
          
          # Embedding layer for text
          embedding_layer = Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=max_sequence_length, trainable=trainable)(text_input)
          
          masked_input = Masking(mask_value=0.0)(embedding_layer)
          
          # flatten = Flatten()(masked_input)
          # print(flatten.shape)
          # Reshape layer to flatten only the last two dimensions
          reshape = Reshape((tweets_text_shape[0], tweets_text_shape[1]*embedding_dim))(masked_input)
      ​
          
          lstm1 = LSTM(64, return_sequences=True)(reshape)
          lstm2 = LSTM(64)(lstm1)
          
          output_layer = Dense(1, activation='sigmoid')(lstm2)
      ​
          model = keras.Model(inputs=text_input, outputs=output_layer)
          model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
      ​
          return model
      xxxxxxxxxx

      batch_size=250, epochs=300¶

      xxxxxxxxxx

      Create and train model¶

      [313]:
       
      model_name = 'model_tweets_data_based_10000_5_v1_batch_size_250_20_latest_tweets_of_user_padded_tweets_text_only'
      ​
      model = create_model_5(num_words=num_words, embedding_dim=embedding_dim, 
                             embedding_matrix=embedding_matrix, max_sequence_length=max_length, 
                             trainable=False, 
                             tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape,
                             optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
      [314]:
       
      model =  train_model(model, 
                            model_name, 
                            train_X=compact_train_tweets_text_data_padded,
                            train_Y=train_users_data_Y,
                            val_X=compact_val_tweets_text_data_padded,
                            val_Y=val_users_data_Y,
                            batch_size=250, 
                            epochs=400,
                            patience=100)
      accuracy
      	training         	 (min:    0.506, max:    0.999, cur:    0.999)
      	validation       	 (min:    0.483, max:    0.532, cur:    0.503)
      Loss
      	training         	 (min:    0.001, max:    0.696, cur:    0.001)
      	validation       	 (min:    0.695, max:    4.273, cur:    4.272)
      
      Epoch 121: val_accuracy did not improve from 0.53163
      28/28 [==============================] - 5s 167ms/step - loss: 9.8464e-04 - accuracy: 0.9993 - val_loss: 4.2722 - val_accuracy: 0.5027
      
      xxxxxxxxxx

      Prediction and results¶

      [315]:
       
      prediction_and_metrics(model, compact_test_tweets_text_data_padded, test_users_data_Y)
      Accuracy: 0.4931506849315068
      Precision: [0.49396135 0.49208861]
      Recall: 0.426027397260274
      F1 score: 0.456681
      ROC AUC: 0.493151
      
      <Figure size 640x480 with 0 Axes>
      [ ]:
       
      ​
      xxxxxxxxxx

      Model 6. (only text of tweets)¶

      xxxxxxxxxx

      Create model¶

      [178]:
       
      def create_model_6(num_words, embedding_dim, embedding_matrix, max_sequence_length, trainable,
                         tweets_text_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)):
          
          # Tweets text input
          text_input = Input(shape=tweets_text_shape)
          
          # Embedding layer for text
          embedding_layer = Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=max_sequence_length, trainable=trainable)(text_input)
          
          masked_input = Masking(mask_value=0.0)(embedding_layer)
          
          # flatten = Flatten()(masked_input)
          # print(flatten.shape)
          # Reshape layer to flatten only the last two dimensions
          reshape = Reshape((tweets_text_shape[0], tweets_text_shape[1]*embedding_dim))(masked_input)
          
          lstm1 = LSTM(64, return_sequences=True, recurrent_dropout=0.2)(reshape)
          lstm2 = LSTM(64, recurrent_dropout=0.1)(lstm1)
          
          output_layer = Dense(1, activation='sigmoid')(lstm2)
      ​
          model = keras.Model(inputs=text_input, outputs=output_layer)
          model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
      ​
          return model
      xxxxxxxxxx

      batch_size=250, epochs=400¶

      xxxxxxxxxx

      Create and train model¶

      [179]:
       
      model_name = 'model_tweets_data_based_10000_6_v1_batch_size_250_20_latest_tweets_of_user_padded_tweets_text_only'
      ​
      model = create_model_6(num_words=num_words, embedding_dim=embedding_dim, 
                             embedding_matrix=embedding_matrix, max_sequence_length=max_length, 
                             trainable=False, 
                             tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape,
                             optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
      [180]:
       
      model =  train_model(model, 
                            model_name, 
                            train_X=compact_train_tweets_text_data_padded,
                            train_Y=train_users_data_Y,
                            val_X=compact_val_tweets_text_data_padded,
                            val_Y=val_users_data_Y,
                            batch_size=250, 
                            epochs=400,
                            patience=100)
      accuracy
      	training         	 (min:    0.504, max:    1.000, cur:    0.999)
      	validation       	 (min:    0.483, max:    0.524, cur:    0.487)
      Loss
      	training         	 (min:    0.001, max:    0.697, cur:    0.001)
      	validation       	 (min:    0.696, max:    4.327, cur:    4.326)
      
      Epoch 163: val_accuracy did not improve from 0.52355
      28/28 [==============================] - 4s 139ms/step - loss: 9.8478e-04 - accuracy: 0.9994 - val_loss: 4.3261 - val_accuracy: 0.4865
      
      xxxxxxxxxx

      Prediction and results¶

      [181]:
       
      prediction_and_metrics(model, compact_test_tweets_text_data_padded, test_users_data_Y)
      Accuracy: 0.49794520547945204
      Precision: [0.49805951 0.49781659]
      Recall: 0.4684931506849315
      F1 score: 0.482710
      ROC AUC: 0.497945
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      Model 7. (only text of tweets)¶

      xxxxxxxxxx

      Create model¶

      [182]:
       
      def create_model_7(num_words, embedding_dim, embedding_matrix, max_sequence_length, trainable,
                         tweets_text_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)):
          
          # Tweets text input
          text_input = Input(shape=tweets_text_shape)
          
          # Embedding layer for text
          embedding_layer = Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=max_sequence_length, trainable=trainable)(text_input)
          
          masked_input = Masking(mask_value=0.0)(embedding_layer)
          
          # flatten = Flatten()(masked_input)
          # print(flatten.shape)
          # Reshape layer to flatten only the last two dimensions
          reshape = Reshape((tweets_text_shape[0], tweets_text_shape[1]*embedding_dim))(masked_input)
          
          lstm1 = LSTM(64, return_sequences=True, activation="relu")(reshape)
          lstm2 = LSTM(64, activation="relu")(lstm1)
          
          output_layer = Dense(1, activation='sigmoid')(lstm2)
      ​
          model = keras.Model(inputs=text_input, outputs=output_layer)
          model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
      ​
          return model
      xxxxxxxxxx

      batch_size=250, epochs=400¶

      xxxxxxxxxx

      Create and train model¶

      [183]:
       
      model_name = 'model_tweets_data_based_10000_7_v1_batch_size_250_20_latest_tweets_of_user_padded_tweets_text_only'
      ​
      model = create_model_7(num_words=num_words, embedding_dim=embedding_dim, 
                             embedding_matrix=embedding_matrix, max_sequence_length=max_length, 
                             trainable=False, 
                             tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape,
                             optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
      [184]:
       
      model =  train_model(model, 
                            model_name, 
                            train_X=compact_train_tweets_text_data_padded,
                            train_Y=train_users_data_Y,
                            val_X=compact_val_tweets_text_data_padded,
                            val_Y=val_users_data_Y,
                            batch_size=250, 
                            epochs=400,
                            patience=100)
      accuracy
      	training         	 (min:    0.498, max:    0.973, cur:    0.923)
      	validation       	 (min:    0.485, max:    0.536, cur:    0.502)
      Loss
      	training         	 (min:    0.053, max:    2.874, cur:    0.217)
      	validation       	 (min:    0.694, max:    7.540, cur:    2.756)
      
      Epoch 193: val_accuracy did not improve from 0.53567
      28/28 [==============================] - 3s 121ms/step - loss: 0.2167 - accuracy: 0.9228 - val_loss: 2.7559 - val_accuracy: 0.5020
      
      xxxxxxxxxx

      Prediction and results¶

      [185]:
       
      prediction_and_metrics(model, compact_test_tweets_text_data_padded, test_users_data_Y)
      Accuracy: 0.5123287671232877
      Precision: [0.51339286 0.51142132]
      Recall: 0.552054794520548
      F1 score: 0.530962
      ROC AUC: 0.512329
      
      <Figure size 640x480 with 0 Axes>
      [187]:
       
      prediction_and_metrics(model, compact_train_tweets_text_data_padded, train_users_data_Y)
      Accuracy: 0.637633170169882
      Precision: [0.64826303 0.62842558]
      Recall: 0.6734811402245897
      F1 score: 0.650174
      ROC AUC: 0.637633
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      Model 8. (only text of tweets)¶

      xxxxxxxxxx

      Create model¶

      [188]:
       
      def create_model_8(num_words, embedding_dim, embedding_matrix, max_sequence_length, trainable,
                         tweets_text_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)):
          
          # Tweets text input
          text_input = Input(shape=tweets_text_shape)
          
          # Embedding layer for text
          embedding_layer = Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=max_sequence_length, trainable=trainable)(text_input)
          
          masked_input = Masking(mask_value=0.0)(embedding_layer)
          
          # flatten = Flatten()(masked_input)
          # print(flatten.shape)
          # Reshape layer to flatten only the last two dimensions
          reshape = Reshape((tweets_text_shape[0], tweets_text_shape[1]*embedding_dim))(masked_input)
          
          lstm1 = LSTM(16, return_sequences=True)(reshape)
          lstm2 = LSTM(16)(lstm1)
          
          output_layer = Dense(1, activation='sigmoid')(lstm2)
      ​
          model = keras.Model(inputs=text_input, outputs=output_layer)
          model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
      ​
          return model
      xxxxxxxxxx

      batch_size=250, epochs=300¶

      xxxxxxxxxx

      Create and train model¶

      [189]:
       
      model_name = 'model_tweets_data_based_10000_8_v1_batch_size_250_20_latest_tweets_of_user_padded_tweets_text_only'
      ​
      model = create_model_8(num_words=num_words, embedding_dim=embedding_dim, 
                             embedding_matrix=embedding_matrix, max_sequence_length=max_length, 
                             trainable=False, 
                             tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape,
                             optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
      [190]:
       
      model =  train_model(model, 
                            model_name, 
                            train_X=compact_train_tweets_text_data_padded,
                            train_Y=train_users_data_Y,
                            val_X=compact_val_tweets_text_data_padded,
                            val_Y=val_users_data_Y,
                            batch_size=250, 
                            epochs=400,
                            patience=100)
      accuracy
      	training         	 (min:    0.504, max:    0.999, cur:    0.999)
      	validation       	 (min:    0.488, max:    0.521, cur:    0.503)
      Loss
      	training         	 (min:    0.002, max:    0.695, cur:    0.002)
      	validation       	 (min:    0.695, max:    3.180, cur:    3.099)
      
      Epoch 109: val_accuracy did not improve from 0.52086
      28/28 [==============================] - 2s 76ms/step - loss: 0.0020 - accuracy: 0.9990 - val_loss: 3.0988 - val_accuracy: 0.5027
      
      xxxxxxxxxx

      Prediction and results¶

      [191]:
       
      prediction_and_metrics(model, compact_test_tweets_text_data_padded, test_users_data_Y)
      Accuracy: 0.510958904109589
      Precision: [0.50952381 0.51290323]
      Recall: 0.43561643835616437
      F1 score: 0.471111
      ROC AUC: 0.510959
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      Model 9. (only text of tweets)¶

      xxxxxxxxxx

      Create model¶

      [200]:
       
      def create_model_9(num_words, embedding_dim, embedding_matrix, max_sequence_length, trainable,
                         tweets_text_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)):
          
          # Tweets text input
          text_input = Input(shape=tweets_text_shape)
          
          # Embedding layer for text
          embedding_layer = Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=max_sequence_length, trainable=trainable)(text_input)
          
          masked_input = Masking(mask_value=0.0)(embedding_layer)
          
          # flatten = Flatten()(masked_input)
          # print(flatten.shape)
          # Reshape layer to flatten only the last two dimensions
          reshape = Reshape((tweets_text_shape[0], tweets_text_shape[1]*embedding_dim))(masked_input)
          
          lstm1 = LSTM(64, return_sequences=True, dropout=0.8, activation="relu")(reshape)
          lstm2 = LSTM(64, activation="relu")(lstm1)
          
          output_layer = Dense(1, activation='sigmoid')(lstm2)
      ​
          model = keras.Model(inputs=text_input, outputs=output_layer)
          model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
      ​
          return model
      xxxxxxxxxx

      batch_size=250, epochs=300¶

      xxxxxxxxxx

      Create and train model¶

      [201]:
       
      model_name = 'model_tweets_data_based_10000_9_v1_batch_size_250_20_latest_tweets_of_user_padded_tweets_text_only'
      ​
      model = create_model_9(num_words=num_words, embedding_dim=embedding_dim, 
                             embedding_matrix=embedding_matrix, max_sequence_length=max_length, 
                             trainable=False, 
                             tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape,
                             optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
      [202]:
       
      model =  train_model(model, 
                            model_name, 
                            train_X=compact_train_tweets_text_data_padded,
                            train_Y=train_users_data_Y,
                            val_X=compact_val_tweets_text_data_padded,
                            val_Y=val_users_data_Y,
                            batch_size=250, 
                            epochs=400,
                            patience=100)
      accuracy
      	training         	 (min:    0.500, max:    0.902, cur:    0.892)
      	validation       	 (min:    0.462, max:    0.517, cur:    0.499)
      Loss
      	training         	 (min:    0.238, max:    0.714, cur:    0.253)
      	validation       	 (min:    0.693, max:    1.640, cur:    1.303)
      
      Epoch 286: val_accuracy did not improve from 0.51750
      28/28 [==============================] - 4s 142ms/step - loss: 0.2527 - accuracy: 0.8920 - val_loss: 1.3031 - val_accuracy: 0.4987
      
      xxxxxxxxxx

      Prediction and results¶

      [205]:
       
      prediction_and_metrics(model, compact_test_tweets_text_data_padded, test_users_data_Y)
      Accuracy: 0.5020547945205479
      Precision: [0.50194553 0.50217707]
      Recall: 0.473972602739726
      F1 score: 0.487667
      ROC AUC: 0.502055
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      Model 10. (only text of tweets)¶

      xxxxxxxxxx

      Create model¶

      [206]:
       
      def create_model_10(num_words, embedding_dim, embedding_matrix, max_sequence_length, trainable,
                         tweets_text_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)):
          
          # Tweets text input
          text_input = Input(shape=tweets_text_shape)
          
          # Embedding layer for text
          embedding_layer = Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=max_sequence_length, trainable=trainable)(text_input)
          
          masked_input = Masking(mask_value=0.0)(embedding_layer)
          
          # flatten = Flatten()(masked_input)
          # print(flatten.shape)
          # Reshape layer to flatten only the last two dimensions
          reshape = Reshape((tweets_text_shape[0], tweets_text_shape[1]*embedding_dim))(masked_input)
          
          lstm1 = Bidirectional(LSTM(64, return_sequences=True, activation="relu"))(reshape)
          lstm2 = LSTM(64, activation="relu")(lstm1)
          
          output_layer = Dense(1, activation='sigmoid')(lstm2)
      ​
          model = keras.Model(inputs=text_input, outputs=output_layer)
          model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
      ​
          return model
      xxxxxxxxxx

      batch_size=250, epochs=300¶

      xxxxxxxxxx

      Create and train model¶

      [207]:
       
      model_name = 'model_tweets_data_based_10000_10_v1_batch_size_250_20_latest_tweets_of_user_padded_tweets_text_only'
      ​
      model = create_model_10(num_words=num_words, embedding_dim=embedding_dim, 
                             embedding_matrix=embedding_matrix, max_sequence_length=max_length, 
                             trainable=False, 
                             tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape,
                             optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
      [208]:
       
      model =  train_model(model, 
                            model_name, 
                            train_X=compact_train_tweets_text_data_padded,
                            train_Y=train_users_data_Y,
                            val_X=compact_val_tweets_text_data_padded,
                            val_Y=val_users_data_Y,
                            batch_size=250, 
                            epochs=400,
                            patience=100)
      accuracy
      	training         	 (min:    0.498, max:    0.980, cur:    0.968)
      	validation       	 (min:    0.472, max:    0.538, cur:    0.485)
      Loss
      	training         	 (min:    0.045, max:   85.676, cur:    0.088)
      	validation       	 (min:    0.694, max:  149.277, cur:    3.631)
      
      Epoch 386: val_accuracy did not improve from 0.53769
      28/28 [==============================] - 5s 171ms/step - loss: 0.0877 - accuracy: 0.9685 - val_loss: 3.6306 - val_accuracy: 0.4845
      
      [440]:
       
      compact_train_tweets_text_data_padded.shape
      [440]:
      (6946, 20, 15)
      xxxxxxxxxx

      Prediction and results¶

      [209]:
       
      prediction_and_metrics(model, compact_test_tweets_text_data_padded, test_users_data_Y)
      Accuracy: 0.5
      Precision: [0.5 0.5]
      Recall: 0.6876712328767123
      F1 score: 0.579008
      ROC AUC: 0.500000
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      Model 11. (only text of tweets)¶

      xxxxxxxxxx

      Create model¶

      [210]:
       
      def create_model_11(num_words, embedding_dim, embedding_matrix, max_sequence_length, trainable,
                         tweets_text_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)):
          
          # Tweets text input
          text_input = Input(shape=tweets_text_shape)
          
          # Embedding layer for text
          embedding_layer = Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=max_sequence_length, trainable=trainable)(text_input)
          
          masked_input = Masking(mask_value=0.0)(embedding_layer)
          
          # flatten = Flatten()(masked_input)
          # print(flatten.shape)
          # Reshape layer to flatten only the last two dimensions
          reshape = Reshape((tweets_text_shape[0], tweets_text_shape[1]*embedding_dim))(masked_input)
          
          lstm1 = LSTM(64, return_sequences=True, activation="relu")(reshape)
          lstm2 = Bidirectional(LSTM(64, activation="relu"))(lstm1)
          
          output_layer = Dense(1, activation='sigmoid')(lstm2)
      ​
          model = keras.Model(inputs=text_input, outputs=output_layer)
          model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
      ​
          return model
      xxxxxxxxxx

      batch_size=250, epochs=300¶

      xxxxxxxxxx

      Create and train model¶

      [211]:
       
      model_name = 'model_tweets_data_based_10000_11_v1_batch_size_250_20_latest_tweets_of_user_padded_tweets_text_only'
      ​
      model = create_model_11(num_words=num_words, embedding_dim=embedding_dim, 
                             embedding_matrix=embedding_matrix, max_sequence_length=max_length, 
                             trainable=False, 
                             tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape,
                             optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
      [212]:
       
      model =  train_model(model, 
                            model_name, 
                            train_X=compact_train_tweets_text_data_padded,
                            train_Y=train_users_data_Y,
                            val_X=compact_val_tweets_text_data_padded,
                            val_Y=val_users_data_Y,
                            batch_size=250, 
                            epochs=400,
                            patience=100)
      accuracy
      	training         	 (min:    0.494, max:    0.802, cur:    0.704)
      	validation       	 (min:    0.476, max:    0.527, cur:    0.500)
      Loss
      	training         	 (min:    0.472, max:   39.167, cur:    0.528)
      	validation       	 (min:    0.694, max:   56.739, cur:    1.835)
      
      Epoch 198: val_accuracy did not improve from 0.52692
      28/28 [==============================] - 3s 125ms/step - loss: 0.5281 - accuracy: 0.7039 - val_loss: 1.8355 - val_accuracy: 0.5000
      
      xxxxxxxxxx

      Prediction and results¶

      [213]:
       
      prediction_and_metrics(model, compact_test_tweets_text_data_padded, test_users_data_Y)
      Accuracy: 0.473972602739726
      Precision: [0.47625    0.47121212]
      Recall: 0.426027397260274
      F1 score: 0.447482
      ROC AUC: 0.473973
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      CNN¶

      xxxxxxxxxx

      Model 12. (only text of tweets)¶

      xxxxxxxxxx

      Create model¶

      [359]:
       
      def create_model_12(num_words, embedding_dim, embedding_matrix, max_sequence_length, trainable,
                         tweets_text_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)):
          
          # Tweets text input
          text_input = Input(shape=tweets_text_shape)
          
          # Embedding layer for text
          embedding_layer = Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=max_sequence_length, trainable=trainable)(text_input)
          
          masked_input = Masking(mask_value=0.0)(embedding_layer)
          
          # flatten = Flatten()(masked_input)
          # print(flatten.shape)
          # Reshape layer to flatten only the last two dimensions
          # reshape = Reshape((tweets_text_shape[0], tweets_text_shape[1]*embedding_dim))(masked_input)
          
          cnn_layer1 = Conv1D(filters=16, kernel_size=3, activation='relu')(masked_input)
          # pooling_layer1 = MaxPooling1D(pool_size=2, strides=1, padding='valid')(cnn_layer1)
          flatten_layer1 = Flatten()(cnn_layer1)
          
          # dropout_layer1 = Dropout(0.2)(flatten_layer1)
          
          output_layer = Dense(1, activation='sigmoid')(flatten_layer1)
      ​
          model = keras.Model(inputs=text_input, outputs=output_layer)
          model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
      ​
          return model
      xxxxxxxxxx

      batch_size=250, epochs=400¶

      xxxxxxxxxx

      Create and train model¶

      [360]:
       
      model_name = 'model_tweets_data_based_10000_12_v1_batch_size_250_20_latest_tweets_of_user_padded_tweets_text_only'
      ​
      model = create_model_12(num_words=num_words, embedding_dim=embedding_dim, 
                             embedding_matrix=embedding_matrix, max_sequence_length=max_length, 
                             trainable=False, 
                             tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape,
                             optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
      [361]:
       
      model =  train_model(model, 
                            model_name, 
                            train_X=compact_train_tweets_text_data_padded,
                            train_Y=train_users_data_Y,
                            val_X=compact_val_tweets_text_data_padded,
                            val_Y=val_users_data_Y,
                            batch_size=250, 
                            epochs=400,
                            patience=100)
      accuracy
      	training         	 (min:    0.505, max:    1.000, cur:    0.999)
      	validation       	 (min:    0.499, max:    0.528, cur:    0.523)
      Loss
      	training         	 (min:    0.005, max:    0.708, cur:    0.006)
      	validation       	 (min:    0.705, max:    5.081, cur:    5.061)
      
      Epoch 232: val_accuracy did not improve from 0.52826
      28/28 [==============================] - 2s 68ms/step - loss: 0.0064 - accuracy: 0.9993 - val_loss: 5.0612 - val_accuracy: 0.5229
      
      xxxxxxxxxx

      Prediction and results¶

      [362]:
       
      prediction_and_metrics(model, compact_test_tweets_text_data_padded, test_users_data_Y)
      Accuracy: 0.47876712328767124
      Precision: [0.4786795  0.47885402]
      Recall: 0.4808219178082192
      F1 score: 0.479836
      ROC AUC: 0.478767
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      Prediction and results on training set¶

      [363]:
       
      prediction_and_metrics(model, compact_train_tweets_text_data_padded, train_users_data_Y)
      Accuracy: 0.9969766772243017
      Precision: [0.99711982 0.99683362]
      Recall: 0.9971206449755254
      F1 score: 0.996977
      ROC AUC: 0.996977
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      Model 13. (only text of tweets)¶

      xxxxxxxxxx

      Create model¶

      [395]:
       
      def create_model_13(num_words, embedding_dim, embedding_matrix, max_sequence_length, trainable,
                         tweets_text_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)):
          
          # Tweets text input
          text_input = Input(shape=tweets_text_shape)
          
          # Embedding layer for text
          embedding_layer = Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=max_sequence_length, trainable=trainable)(text_input)
          
          masked_input = Masking(mask_value=0.0)(embedding_layer)
          
          # flatten = Flatten()(masked_input)
          # print(flatten.shape)
          # Reshape layer to flatten only the last two dimensions
          # reshape = Reshape((tweets_text_shape[0], tweets_text_shape[1]*embedding_dim))(masked_input)
          
          cnn_layer1 = Conv2D(filters=16, kernel_size=3, activation='relu')(masked_input)
          # pooling_layer1 = MaxPooling1D(pool_size=2, strides=1, padding='valid')(cnn_layer1)
          flatten_layer1 = Flatten()(cnn_layer1)
          
          # dropout_layer1 = Dropout(0.2)(flatten_layer1)
          
          output_layer = Dense(1, activation='sigmoid')(flatten_layer1)
      ​
          model = keras.Model(inputs=text_input, outputs=output_layer)
          model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
      ​
          return model
      xxxxxxxxxx

      batch_size=250, epochs=400¶

      xxxxxxxxxx

      Create and train model¶

      [396]:
       
      model_name = 'model_tweets_data_based_10000_13_v1_batch_size_250_20_latest_tweets_of_user_padded_tweets_text_only'
      ​
      model = create_model_13(num_words=num_words, embedding_dim=embedding_dim, 
                             embedding_matrix=embedding_matrix, max_sequence_length=max_length, 
                             trainable=False, 
                             tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape,
                             optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
      [397]:
       
      model =  train_model(model, 
                            model_name, 
                            train_X=compact_train_tweets_text_data_padded,
                            train_Y=train_users_data_Y,
                            val_X=compact_val_tweets_text_data_padded,
                            val_Y=val_users_data_Y,
                            batch_size=250, 
                            epochs=400,
                            patience=100)
      accuracy
      	training         	 (min:    0.504, max:    1.000, cur:    0.999)
      	validation       	 (min:    0.497, max:    0.524, cur:    0.509)
      Loss
      	training         	 (min:    0.003, max:    0.705, cur:    0.005)
      	validation       	 (min:    0.698, max:    4.305, cur:    4.284)
      
      Epoch 174: val_accuracy did not improve from 0.52355
      28/28 [==============================] - 3s 100ms/step - loss: 0.0047 - accuracy: 0.9990 - val_loss: 4.2841 - val_accuracy: 0.5094
      
      xxxxxxxxxx

      Prediction and results¶

      [398]:
       
      prediction_and_metrics(model, compact_test_tweets_text_data_padded, test_users_data_Y)
      Accuracy: 0.5013698630136987
      Precision: [0.50143266 0.50131234]
      Recall: 0.5232876712328767
      F1 score: 0.512064
      ROC AUC: 0.501370
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      Prediction and results on training set¶

      [399]:
       
      prediction_and_metrics(model, compact_train_tweets_text_data_padded, train_users_data_Y)
      Accuracy: 0.9975525482291967
      Precision: [0.99712313 0.99798271]
      Recall: 0.9971206449755254
      F1 score: 0.997551
      ROC AUC: 0.997553
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      Model 14. (only text of tweets)¶

      xxxxxxxxxx

      Create model¶

      [428]:
       
      def create_model_14(num_words, embedding_dim, embedding_matrix, max_sequence_length, trainable,
                         tweets_text_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)):
          
          # Tweets text input
          text_input = Input(shape=tweets_text_shape)
          
          # Embedding layer for text
          embedding_layer = Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=max_sequence_length, trainable=trainable)(text_input)
          
          masked_input = Masking(mask_value=0.0)(embedding_layer)
          
          # flatten = Flatten()(masked_input)
          # print(flatten.shape)
          # Reshape layer to flatten only the last two dimensions
          # reshape = Reshape((tweets_text_shape[0], tweets_text_shape[1]*embedding_dim))(masked_input)
          
          cnn_layer1 = Conv2D(filters=16, kernel_size=(3,3), activation='relu')(masked_input)
          # pooling_layer1 = MaxPooling1D(pool_size=2, strides=1, padding='valid')(cnn_layer1)
          flatten_layer1 = Flatten()(cnn_layer1)
          
          dropout_layer1 = Dropout(0.2)(flatten_layer1)
          
          output_layer = Dense(1, activation='sigmoid')(dropout_layer1)
      ​
          model = keras.Model(inputs=text_input, outputs=output_layer)
          model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
      ​
          return model
      xxxxxxxxxx

      batch_size=250, epochs=400¶

      xxxxxxxxxx

      Create and train model¶

      [429]:
       
      model_name = 'model_tweets_data_based_10000_14_v1_batch_size_250_20_latest_tweets_of_user_padded_tweets_text_only'
      ​
      model = create_model_14(num_words=num_words, embedding_dim=embedding_dim, 
                             embedding_matrix=embedding_matrix, max_sequence_length=max_length, 
                             trainable=False, 
                             tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape,
                             optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
      [430]:
       
      model =  train_model(model, 
                            model_name, 
                            train_X=compact_train_tweets_text_data_padded,
                            train_Y=train_users_data_Y,
                            val_X=compact_val_tweets_text_data_padded,
                            val_Y=val_users_data_Y,
                            batch_size=250, 
                            epochs=400,
                            patience=100)
      accuracy
      	training         	 (min:    0.500, max:    0.993, cur:    0.993)
      	validation       	 (min:    0.478, max:    0.536, cur:    0.510)
      Loss
      	training         	 (min:    0.026, max:    0.715, cur:    0.026)
      	validation       	 (min:    0.700, max:    2.998, cur:    2.873)
      
      Epoch 330: val_accuracy did not improve from 0.53634
      28/28 [==============================] - 3s 112ms/step - loss: 0.0257 - accuracy: 0.9929 - val_loss: 2.8729 - val_accuracy: 0.5101
      
      xxxxxxxxxx

      Prediction and results¶

      [431]:
       
      prediction_and_metrics(model, compact_test_tweets_text_data_padded, test_users_data_Y)
      Accuracy: 0.4876712328767123
      Precision: [0.48728814 0.48803191]
      Recall: 0.5027397260273972
      F1 score: 0.495277
      ROC AUC: 0.487671
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      Prediction and results on training set¶

      [432]:
       
      prediction_and_metrics(model, compact_train_tweets_text_data_padded, train_users_data_Y)
      Accuracy: 0.9971206449755254
      Precision: [0.99683453 0.99740709]
      Recall: 0.996832709473078
      F1 score: 0.997120
      ROC AUC: 0.997121
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      Model 15. (only text of tweets)¶

      xxxxxxxxxx

      Create model¶

      [433]:
       
      def create_model_15(num_words, embedding_dim, embedding_matrix, max_sequence_length, trainable,
                         tweets_text_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)):
          
          # Tweets text input
          text_input = Input(shape=tweets_text_shape)
          
          # Embedding layer for text
          embedding_layer = Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=max_sequence_length, trainable=trainable)(text_input)
          
          masked_input = Masking(mask_value=0.0)(embedding_layer)
          
          # flatten = Flatten()(masked_input)
          # print(flatten.shape)
          # Reshape layer to flatten only the last two dimensions
          # reshape = Reshape((tweets_text_shape[0], tweets_text_shape[1]*embedding_dim))(masked_input)
          
          cnn_layer1 = Conv1D(filters=16, kernel_size=3, activation='relu')(masked_input)
          dropout_layer1 = Dropout(0.5)(cnn_layer1)
          # pooling_layer1 = MaxPooling1D(pool_size=2, strides=1, padding='valid')(cnn_layer1)
          flatten_layer1 = Flatten()(cnn_layer1)
          
          dropout_layer2 = Dropout(0.2)(flatten_layer1)
          
          output_layer = Dense(1, activation='sigmoid')(dropout_layer2)
      ​
          model = keras.Model(inputs=text_input, outputs=output_layer)
          model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
      ​
          return model
      xxxxxxxxxx

      batch_size=250, epochs=400¶

      xxxxxxxxxx

      Create and train model¶

      [434]:
       
      model_name = 'model_tweets_data_based_10000_15_v1_batch_size_250_20_latest_tweets_of_user_padded_tweets_text_only'
      ​
      model = create_model_15(num_words=num_words, embedding_dim=embedding_dim, 
                             embedding_matrix=embedding_matrix, max_sequence_length=max_length, 
                             trainable=False, 
                             tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape,
                             optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
      [435]:
       
      model =  train_model(model, 
                            model_name, 
                            train_X=compact_train_tweets_text_data_padded,
                            train_Y=train_users_data_Y,
                            val_X=compact_val_tweets_text_data_padded,
                            val_Y=val_users_data_Y,
                            batch_size=250, 
                            epochs=400,
                            patience=100)
      accuracy
      	training         	 (min:    0.503, max:    0.929, cur:    0.929)
      	validation       	 (min:    0.495, max:    0.517, cur:    0.508)
      Loss
      	training         	 (min:    0.174, max:    0.724, cur:    0.175)
      	validation       	 (min:    0.700, max:    1.543, cur:    1.539)
      
      Epoch 103: val_accuracy did not improve from 0.51750
      28/28 [==============================] - 2s 67ms/step - loss: 0.1748 - accuracy: 0.9289 - val_loss: 1.5394 - val_accuracy: 0.5081
      
      xxxxxxxxxx

      Prediction and results¶

      [436]:
       
      prediction_and_metrics(model, compact_test_tweets_text_data_padded, test_users_data_Y)
      Accuracy: 0.5260273972602739
      Precision: [0.52661064 0.52546917]
      Recall: 0.536986301369863
      F1 score: 0.531165
      ROC AUC: 0.526027
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      Prediction and results on training set¶

      [437]:
       
      prediction_and_metrics(model, compact_train_tweets_text_data_padded, train_users_data_Y)
      Accuracy: 0.6474229772530953
      Precision: [0.64528944 0.64962011]
      Recall: 0.6400806219406853
      F1 score: 0.644815
      ROC AUC: 0.647423
      
      <Figure size 640x480 with 0 Axes>
      [ ]:
       
      ​
      [ ]:
       
      ​
      xxxxxxxxxx


      xxxxxxxxxx

      Tweets text and additional tweets data¶

      xxxxxxxxxx

      Model 1. (tweets text and addtitional tweets features)¶

      xxxxxxxxxx

      Create model¶

      [447]:
       
      def create_model_1(num_words, embedding_dim, embedding_matrix, max_sequence_length, trainable,
                         tweets_text_shape, add_tweets_feat_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)):
          
          # Additional tweet's features input
          additional_tweet_input = Input(shape=add_tweets_feat_shape) 
          masked_input_add_tweets_feat = Masking(mask_value=0.0)(additional_tweet_input)
          
          # lstm_layer_1 = LSTM(64, return_sequences=True)(masked_input_add_tweets_feat)
          lstm_layer_2 = LSTM(64)(masked_input_add_tweets_feat)
          
          # ---------------------------------------------------------------------
          
           # Tweets text input
          text_input = Input(shape=tweets_text_shape)
          # Embedding layer for text
          embedding_layer = Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=max_sequence_length, trainable=trainable)(text_input)
          masked_input_text = Masking(mask_value=0.0)(embedding_layer)
          reshaped = Reshape((tweets_text_shape[0], tweets_text_shape[1]*embedding_dim))(masked_input_text)
          
          cnn_layer1 = Conv1D(filters=16, kernel_size=3, activation='relu')(reshaped)
          # dropout_layer1 = Dropout(0.5)(cnn_layer1)
          # pooling_layer1 = MaxPooling1D(pool_size=2, strides=1, padding='valid')(cnn_layer1)
          cnn_flatten_layer1 = Flatten()(cnn_layer1)
          
          # ---------------------------------------------------------------------
          
          # Concatenate text and additional features
          concatenated = concatenate([lstm_layer_2, cnn_flatten_layer1])
          
          # dense_layer1 = Dense(128)(concatenated)
          # activation_layer1 = Activation('relu')(dense_layer1)
          # dropout_layer1 = Dropout(0.2)(flatten_layer1)
          
          output_layer = Dense(1, activation='sigmoid')(concatenated)
      ​
          model = keras.Model(inputs=[text_input, additional_tweet_input], outputs=output_layer)
          model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
      ​
          return model
      xxxxxxxxxx

      batch_size=250, epochs=400¶

      xxxxxxxxxx

      Create and train model¶

      [448]:
       
      model_name = 'model_tweets_data_based_10000_1_v1_batch_size_250_20_latest_tweets_of_user_padded_tweets_text_and_additional_tweets_features'
      ​
      model = create_model_1(num_words=num_words, embedding_dim=embedding_dim, 
                             embedding_matrix=embedding_matrix, max_sequence_length=max_length, 
                             trainable=False, 
                             tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape,
                             add_tweets_feat_shape=np.array(compact_train_tweets_add_feat_data_padded[0]).shape,
                             optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
      [449]:
       
      model =  train_model(model, 
                            model_name, 
                            train_X=[compact_train_tweets_text_data_padded,compact_train_tweets_add_feat_data_padded],
                            train_Y=train_users_data_Y,
                            val_X=[compact_val_tweets_text_data_padded, compact_val_tweets_add_feat_data_padded],
                            val_Y=val_users_data_Y,
                            batch_size=250, 
                            epochs=400,
                            patience=100)
      accuracy
      	training         	 (min:    0.500, max:    1.000, cur:    1.000)
      	validation       	 (min:    0.506, max:    0.540, cur:    0.530)
      Loss
      	training         	 (min:    0.001, max:    0.704, cur:    0.001)
      	validation       	 (min:    0.693, max:    4.179, cur:    4.179)
      
      Epoch 184: val_accuracy did not improve from 0.54038
      28/28 [==============================] - 2s 88ms/step - loss: 6.1684e-04 - accuracy: 0.9997 - val_loss: 4.1787 - val_accuracy: 0.5303
      
      xxxxxxxxxx

      zbior trenujacy 4 epoki > 0.9¶

      xxxxxxxxxx

      Prediction and results¶

      [450]:
       
      prediction_and_metrics(model, [compact_test_tweets_text_data_padded, compact_test_tweets_add_feat_data_padded], test_users_data_Y)
      Accuracy: 0.4897260273972603
      Precision: [0.48951049 0.48993289]
      Recall: 0.5
      F1 score: 0.494915
      ROC AUC: 0.489726
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      Prediction and results on training set¶

      [452]:
       
      prediction_and_metrics(model, [compact_train_tweets_text_data_padded, compact_train_tweets_add_feat_data_padded], train_users_data_Y)
      Accuracy: 0.9995680967463288
      Precision: [0.99971198 0.99942429]
      Recall: 0.9997120644975526
      F1 score: 0.999568
      ROC AUC: 0.999568
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      Model 2. (tweets text and addtitional tweets features)¶

      xxxxxxxxxx

      Create model¶

      [453]:
       
      def create_model_2(num_words, embedding_dim, embedding_matrix, max_sequence_length, trainable,
                         tweets_text_shape, add_tweets_feat_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)):
          
          # Additional tweet's features input
          additional_tweet_input = Input(shape=add_tweets_feat_shape) 
          masked_input_add_tweets_feat = Masking(mask_value=0.0)(additional_tweet_input)
          
          # lstm_layer_1 = LSTM(64, return_sequences=True)(masked_input_add_tweets_feat)
          lstm_layer_2 = LSTM(64)(masked_input_add_tweets_feat)
          lstm_dropout_layer1 = Dropout(0.25)(lstm_layer_2)
          
          # ---------------------------------------------------------------------
          
           # Tweets text input
          text_input = Input(shape=tweets_text_shape)
          # Embedding layer for text
          embedding_layer = Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=max_sequence_length, trainable=trainable)(text_input)
          masked_input_text = Masking(mask_value=0.0)(embedding_layer)
          reshaped = Reshape((tweets_text_shape[0], tweets_text_shape[1]*embedding_dim))(masked_input_text)
          
          cnn_layer1 = Conv1D(filters=16, kernel_size=3, activation='relu')(reshaped)
          cnn_dropout_layer1 = Dropout(0.25)(cnn_layer1)
          # pooling_layer1 = MaxPooling1D(pool_size=2, strides=1, padding='valid')(cnn_layer1)
          cnn_flatten_layer1 = Flatten()(cnn_layer1)
          cnn_dropout_layer2 = Dropout(0.25)(cnn_flatten_layer1)
          
          # ---------------------------------------------------------------------
          
          # Concatenate text and additional features
          concatenated = concatenate([lstm_dropout_layer1, cnn_dropout_layer2])
          
          # dense_layer1 = Dense(128)(concatenated)
          # activation_layer1 = Activation('relu')(dense_layer1)
          # dropout_layer1 = Dropout(0.2)(flatten_layer1)
          
          output_layer = Dense(1, activation='sigmoid')(concatenated)
      ​
          model = keras.Model(inputs=[text_input, additional_tweet_input], outputs=output_layer)
          model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
      ​
          return model
      xxxxxxxxxx

      batch_size=250, epochs=400¶

      xxxxxxxxxx

      Create and train model¶

      [454]:
       
      model_name = 'model_tweets_data_based_10000_2_v1_batch_size_250_20_latest_tweets_of_user_padded_tweets_text_and_additional_tweets_features'
      ​
      model = create_model_2(num_words=num_words, embedding_dim=embedding_dim, 
                             embedding_matrix=embedding_matrix, max_sequence_length=max_length, 
                             trainable=False, 
                             tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape,
                             add_tweets_feat_shape=np.array(compact_train_tweets_add_feat_data_padded[0]).shape,
                             optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
      [455]:
       
      model =  train_model(model, 
                            model_name, 
                            train_X=[compact_train_tweets_text_data_padded,compact_train_tweets_add_feat_data_padded],
                            train_Y=train_users_data_Y,
                            val_X=[compact_val_tweets_text_data_padded, compact_val_tweets_add_feat_data_padded],
                            val_Y=val_users_data_Y,
                            batch_size=250, 
                            epochs=400,
                            patience=100)
      accuracy
      	training         	 (min:    0.504, max:    0.998, cur:    0.997)
      	validation       	 (min:    0.476, max:    0.527, cur:    0.512)
      Loss
      	training         	 (min:    0.009, max:    0.700, cur:    0.010)
      	validation       	 (min:    0.695, max:    3.100, cur:    2.997)
      
      Epoch 178: val_accuracy did not improve from 0.52692
      28/28 [==============================] - 3s 95ms/step - loss: 0.0104 - accuracy: 0.9968 - val_loss: 2.9973 - val_accuracy: 0.5121
      
      xxxxxxxxxx

      Prediction and results¶

      [456]:
       
      prediction_and_metrics(model, [compact_test_tweets_text_data_padded, compact_test_tweets_add_feat_data_padded], test_users_data_Y)
      Accuracy: 0.5020547945205479
      Precision: [0.50223547 0.50190114]
      Recall: 0.5424657534246575
      F1 score: 0.521396
      ROC AUC: 0.502055
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      Prediction and results on training set¶

      [457]:
       
      prediction_and_metrics(model, [compact_train_tweets_text_data_padded, compact_train_tweets_add_feat_data_padded], train_users_data_Y)
      Accuracy: 0.9995680967463288
      Precision: [1.         0.99913694]
      Recall: 1.0
      F1 score: 0.999568
      ROC AUC: 0.999568
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      Model 3. (tweets text and addtitional tweets features)¶

      xxxxxxxxxx

      Create model¶

      [176]:
      x
      def create_model_3(num_words, embedding_dim, embedding_matrix, max_sequence_length, trainable,
                         tweets_text_shape, add_tweets_feat_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)):
          
          # Additional tweet's features input
          additional_tweet_input = Input(shape=add_tweets_feat_shape) 
          masked_input_add_tweets_feat = Masking(mask_value=0.0)(additional_tweet_input)
          
          # lstm_layer_1 = LSTM(64, return_sequences=True)(masked_input_add_tweets_feat)
          lstm_layer_2 = LSTM(64)(masked_input_add_tweets_feat)
          lstm_dropout_layer1 = Dropout(0.5)(lstm_layer_2)
          
          # ---------------------------------------------------------------------
          
           # Tweets text input
          text_input = Input(shape=tweets_text_shape)
          # Embedding layer for text
          embedding_layer = Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=max_sequence_length, trainable=trainable)(text_input)
          masked_input_text = Masking(mask_value=0.0)(embedding_layer)
          reshaped = Reshape((tweets_text_shape[0], tweets_text_shape[1]*embedding_dim))(masked_input_text)
          
          cnn_layer1 = Conv1D(filters=16, kernel_size=3, activation='relu')(reshaped)
          cnn_dropout_layer1 = Dropout(0.5)(cnn_layer1)
          # pooling_layer1 = MaxPooling1D(pool_size=2, strides=1, padding='valid')(cnn_layer1)
          cnn_flatten_layer1 = Flatten()(cnn_layer1)
          cnn_dropout_layer2 = Dropout(0.25)(cnn_flatten_layer1)
          
          # ---------------------------------------------------------------------
          
          # Concatenate text and additional features
          concatenated = concatenate([lstm_dropout_layer1, cnn_dropout_layer2])
          
          # dense_layer1 = Dense(128)(concatenated)
          # activation_layer1 = Activation('relu')(dense_layer1)
          # dropout_layer1 = Dropout(0.2)(flatten_layer1)
          
          output_layer = Dense(1, activation='sigmoid')(concatenated)
      ​
          model = keras.Model(inputs=[text_input, additional_tweet_input], outputs=output_layer)
          model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
      ​
          return model
      2023-09-03 23:50:59.834565: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda/lib:/usr/local/lib/x86_64-linux-gnu:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
      2023-09-03 23:50:59.879449: W tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: UNKNOWN ERROR (303)
      2023-09-03 23:50:59.879546: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (0f59a5867620): /proc/driver/nvidia/version does not exist
      2023-09-03 23:51:00.057054: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
      To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
      
      xxxxxxxxxx

      batch_size=250, epochs=400¶

      xxxxxxxxxx

      Create and train model¶

      [177]:
       
      model_name = 'model_tweets_data_based_10000_3_v1_batch_size_250_20_latest_tweets_of_user_padded_tweets_text_and_additional_tweets_features'
      ​
      model = create_model_3(num_words=num_words, embedding_dim=embedding_dim, 
                             embedding_matrix=embedding_matrix, max_sequence_length=max_length, 
                             trainable=False, 
                             tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape,
                             add_tweets_feat_shape=np.array(compact_train_tweets_add_feat_data_padded[0]).shape,
                             optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
      [178]:
       
      model =  train_model(model, 
                            model_name, 
                            train_X=[compact_train_tweets_text_data_padded,compact_train_tweets_add_feat_data_padded],
                            train_Y=train_users_data_Y,
                            val_X=[compact_val_tweets_text_data_padded, compact_val_tweets_add_feat_data_padded],
                            val_Y=val_users_data_Y,
                            batch_size=250, 
                            epochs=400,
                            patience=100)
      accuracy
      	training         	 (min:    0.503, max:    0.998, cur:    0.995)
      	validation       	 (min:    0.493, max:    0.525, cur:    0.509)
      Loss
      	training         	 (min:    0.012, max:    0.701, cur:    0.015)
      	validation       	 (min:    0.694, max:    2.495, cur:    2.408)
      
      Epoch 107: val_accuracy did not improve from 0.52490
      28/28 [==============================] - 2s 84ms/step - loss: 0.0153 - accuracy: 0.9954 - val_loss: 2.4077 - val_accuracy: 0.5087
      
      xxxxxxxxxx

      Prediction and results¶

      [179]:
       
      prediction_and_metrics(model, [compact_test_tweets_text_data_padded, compact_test_tweets_add_feat_data_padded], test_users_data_Y)
      Accuracy: 0.5061643835616438
      Precision: [0.50704225 0.50548112]
      Recall: 0.5684931506849316
      F1 score: 0.535139
      ROC AUC: 0.506164
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      Prediction and results on training set¶

      [180]:
       
      prediction_and_metrics(model, [compact_train_tweets_text_data_padded, compact_train_tweets_add_feat_data_padded], train_users_data_Y)
      Accuracy: 0.9412611575007198
      Precision: [0.93035664 0.95273264]
      Recall: 0.9285919953930319
      F1 score: 0.940507
      ROC AUC: 0.941261
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      Model 4. (tweets text and addtitional tweets features)¶

      xxxxxxxxxx

      Create model¶

      [181]:
      x
      def create_model_4(num_words, embedding_dim, embedding_matrix, max_sequence_length, trainable,
                         tweets_text_shape, add_tweets_feat_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)):
          
          # Additional tweet's features input
          additional_tweet_input = Input(shape=add_tweets_feat_shape) 
          masked_input_add_tweets_feat = Masking(mask_value=0.0)(additional_tweet_input)
          
          # lstm_layer_1 = LSTM(64, return_sequences=True)(masked_input_add_tweets_feat)
          lstm_layer_2 = LSTM(64)(masked_input_add_tweets_feat)
          lstm_dropout_layer1 = Dropout(0.25)(lstm_layer_2)
          
          # ---------------------------------------------------------------------
          
           # Tweets text input
          text_input = Input(shape=tweets_text_shape)
          # Embedding layer for text
          embedding_layer = Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=max_sequence_length, trainable=trainable)(text_input)
          masked_input_text = Masking(mask_value=0.0)(embedding_layer)
          reshaped = Reshape((tweets_text_shape[0], tweets_text_shape[1]*embedding_dim))(masked_input_text)
          
          cnn_layer1 = Conv1D(filters=16, kernel_size=3, activation='relu')(reshaped)
          cnn_dropout_layer1 = Dropout(0.25)(cnn_layer1)
          # pooling_layer1 = MaxPooling1D(pool_size=2, strides=1, padding='valid')(cnn_layer1)
          cnn_flatten_layer1 = Flatten()(cnn_layer1)
          cnn_dropout_layer2 = Dropout(0.25)(cnn_flatten_layer1)
          
          # ---------------------------------------------------------------------
          
          # Concatenate text and additional features
          concatenated = concatenate([lstm_dropout_layer1, cnn_dropout_layer2])
          
          # dense_layer1 = Dense(128)(concatenated)
          # activation_layer1 = Activation('relu')(dense_layer1)
          # dropout_layer1 = Dropout(0.2)(flatten_layer1)
          
          output_layer = Dense(1, activation='sigmoid')(concatenated)
      ​
          model = keras.Model(inputs=[text_input, additional_tweet_input], outputs=output_layer)
          model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
      ​
          return model
      xxxxxxxxxx

      batch_size=250, epochs=400¶

      xxxxxxxxxx

      Create and train model¶

      [182]:
       
      model_name = 'model_tweets_data_based_10000_4_v1_batch_size_250_20_latest_tweets_of_user_padded_tweets_text_and_additional_tweets_features'
      ​
      model = create_model_4(num_words=num_words, embedding_dim=embedding_dim, 
                             embedding_matrix=embedding_matrix, max_sequence_length=max_length, 
                             trainable=False, 
                             tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape,
                             add_tweets_feat_shape=np.array(compact_train_tweets_add_feat_data_padded[0]).shape,
                             optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
      [183]:
       
      model =  train_model(model, 
                            model_name, 
                            train_X=[compact_train_tweets_text_data_padded,compact_train_tweets_add_feat_data_padded],
                            train_Y=train_users_data_Y,
                            val_X=[compact_val_tweets_text_data_padded, compact_val_tweets_add_feat_data_padded],
                            val_Y=val_users_data_Y,
                            batch_size=250, 
                            epochs=400,
                            patience=100)
      accuracy
      	training         	 (min:    0.504, max:    0.998, cur:    0.997)
      	validation       	 (min:    0.487, max:    0.536, cur:    0.516)
      Loss
      	training         	 (min:    0.007, max:    0.702, cur:    0.010)
      	validation       	 (min:    0.693, max:    3.126, cur:    3.084)
      
      Epoch 198: val_accuracy did not improve from 0.53567
      28/28 [==============================] - 2s 90ms/step - loss: 0.0104 - accuracy: 0.9967 - val_loss: 3.0836 - val_accuracy: 0.5162
      
      xxxxxxxxxx

      Prediction and results¶

      [184]:
       
      prediction_and_metrics(model, [compact_test_tweets_text_data_padded, compact_test_tweets_add_feat_data_padded], test_users_data_Y)
      Accuracy: 0.4780821917808219
      Precision: [0.4761194  0.47974684]
      Recall: 0.5191780821917809
      F1 score: 0.498684
      ROC AUC: 0.478082
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      Prediction and results on training set¶

      [185]:
       
      prediction_and_metrics(model, [compact_train_tweets_text_data_padded, compact_train_tweets_add_feat_data_padded], train_users_data_Y)
      Accuracy: 0.9994241289951051
      Precision: [0.9997119  0.99913669]
      Recall: 0.9997120644975526
      F1 score: 0.999424
      ROC AUC: 0.999424
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      Model 5. (tweets text and addtitional tweets features)¶

      xxxxxxxxxx

      Create model¶

      [189]:
      x
      def create_model_5(num_words, embedding_dim, embedding_matrix, max_sequence_length, trainable,
                         tweets_text_shape, add_tweets_feat_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)):
          
          # Additional tweet's features input
          additional_tweet_input = Input(shape=add_tweets_feat_shape) 
          masked_input_add_tweets_feat = Masking(mask_value=0.0)(additional_tweet_input)
          
          # lstm_layer_1 = LSTM(64, return_sequences=True)(masked_input_add_tweets_feat)
          lstm_layer_2 = LSTM(64)(masked_input_add_tweets_feat)
          lstm_dropout_layer1 = Dropout(0.5)(lstm_layer_2)
          
          # ---------------------------------------------------------------------
          
           # Tweets text input
          text_input = Input(shape=tweets_text_shape)
          # Embedding layer for text
          embedding_layer = Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=max_sequence_length, trainable=trainable)(text_input)
          masked_input_text = Masking(mask_value=0.0)(embedding_layer)
          reshaped = Reshape((tweets_text_shape[0], tweets_text_shape[1]*embedding_dim))(masked_input_text)
          
          cnn_layer1 = Conv1D(filters=16, kernel_size=3, activation='relu')(reshaped)
          cnn_dropout_layer1 = Dropout(0.5)(cnn_layer1)
          # pooling_layer1 = MaxPooling1D(pool_size=2, strides=1, padding='valid')(cnn_layer1)
          cnn_flatten_layer1 = Flatten()(cnn_layer1)
          cnn_dropout_layer2 = Dropout(0.25)(cnn_flatten_layer1)
          
          # ---------------------------------------------------------------------
          
          # Concatenate text and additional features
          concatenated = concatenate([lstm_dropout_layer1, cnn_dropout_layer2])
          
          concatenated_dense_layer1 = Dense(16)(concatenated)
          concatenated_activation_layer1 = Activation('relu')(concatenated_dense_layer1)
          concatenated_dropout_layer1 = Dropout(0.2)(concatenated_activation_layer1)
          
          output_layer = Dense(1, activation='sigmoid')(concatenated_dropout_layer1)
      ​
          model = keras.Model(inputs=[text_input, additional_tweet_input], outputs=output_layer)
          model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
      ​
          return model
      xxxxxxxxxx

      batch_size=250, epochs=400¶

      xxxxxxxxxx

      Create and train model¶

      [190]:
       
      model_name = 'model_tweets_data_based_10000_5_v1_batch_size_250_20_latest_tweets_of_user_padded_tweets_text_and_additional_tweets_features'
      ​
      model = create_model_5(num_words=num_words, embedding_dim=embedding_dim, 
                             embedding_matrix=embedding_matrix, max_sequence_length=max_length, 
                             trainable=False, 
                             tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape,
                             add_tweets_feat_shape=np.array(compact_train_tweets_add_feat_data_padded[0]).shape,
                             optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
      [191]:
       
      model =  train_model(model, 
                            model_name, 
                            train_X=[compact_train_tweets_text_data_padded,compact_train_tweets_add_feat_data_padded],
                            train_Y=train_users_data_Y,
                            val_X=[compact_val_tweets_text_data_padded, compact_val_tweets_add_feat_data_padded],
                            val_Y=val_users_data_Y,
                            batch_size=250, 
                            epochs=400,
                            patience=100)
      accuracy
      	training         	 (min:    0.496, max:    0.996, cur:    0.995)
      	validation       	 (min:    0.483, max:    0.530, cur:    0.521)
      Loss
      	training         	 (min:    0.012, max:    0.701, cur:    0.017)
      	validation       	 (min:    0.694, max:    3.489, cur:    3.364)
      
      Epoch 158: val_accuracy did not improve from 0.53028
      28/28 [==============================] - 2s 90ms/step - loss: 0.0168 - accuracy: 0.9947 - val_loss: 3.3637 - val_accuracy: 0.5209
      
      xxxxxxxxxx

      Prediction and results¶

      [192]:
       
      prediction_and_metrics(model, [compact_test_tweets_text_data_padded, compact_test_tweets_add_feat_data_padded], test_users_data_Y)
      Accuracy: 0.5020547945205479
      Precision: [0.50202977 0.50208044]
      Recall: 0.4958904109589041
      F1 score: 0.498966
      ROC AUC: 0.502055
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      Prediction and results on training set¶

      [193]:
       
      prediction_and_metrics(model, [compact_train_tweets_text_data_padded, compact_train_tweets_add_feat_data_padded], train_users_data_Y)
      Accuracy: 0.9988482579902102
      Precision: [0.99942346 0.99827437]
      Recall: 0.9994241289951051
      F1 score: 0.998849
      ROC AUC: 0.998848
      
      <Figure size 640x480 with 0 Axes>
      [ ]:
       
      ​
      xxxxxxxxxx


      x
      # Tweets text and additional tweets data and user data
      [228]:
      x
      train_users_data = tf.convert_to_tensor(train_users_data, dtype=tf.float32)
      val_users_data = tf.convert_to_tensor(val_users_data, dtype=tf.float32)
      test_users_data = tf.convert_to_tensor(test_users_data, dtype=tf.float32)
      x
      ## Model 1. (tweets text and addtitional tweets features and user data)
      x
      #### Create model
      [233]:
      x
      def create_model_1(num_words, embedding_dim, embedding_matrix, max_sequence_length, trainable,
                         tweets_text_shape, add_tweets_feat_shape, user_data_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)):
          
          
          # Additional tweet's features input
          user_data_input = Input(shape=user_data_shape) 
      ​
          # ---------------------------------------------------------------------
          
          # Additional tweet's features input
          additional_tweet_input = Input(shape=add_tweets_feat_shape) 
          masked_input_add_tweets_feat = Masking(mask_value=0.0)(additional_tweet_input)
          
          # lstm_layer_1 = LSTM(64, return_sequences=True)(masked_input_add_tweets_feat)
          lstm_layer_2 = LSTM(16)(masked_input_add_tweets_feat)
          lstm_dropout_layer1 = Dropout(0.5)(lstm_layer_2)
          
          # ---------------------------------------------------------------------
          
           # Tweets text input
          text_input = Input(shape=tweets_text_shape)
          # Embedding layer for text
          embedding_layer = Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=max_sequence_length, trainable=trainable)(text_input)
          masked_input_text = Masking(mask_value=0.0)(embedding_layer)
          reshaped = Reshape((tweets_text_shape[0], tweets_text_shape[1]*embedding_dim))(masked_input_text)
          
          cnn_layer1 = Conv1D(filters=16, kernel_size=3, activation='relu')(reshaped)
          cnn_dropout_layer1 = Dropout(0.5)(cnn_layer1)
          # pooling_layer1 = MaxPooling1D(pool_size=2, strides=1, padding='valid')(cnn_layer1)
          cnn_flatten_layer1 = Flatten()(cnn_layer1)
          cnn_dropout_layer2 = Dropout(0.5)(cnn_flatten_layer1)
          
          # ---------------------------------------------------------------------
          
          # Concatenate text and additional features
          concatenated = concatenate([lstm_dropout_layer1, cnn_dropout_layer2, user_data_input])
          
          dense_layer1 = Dense(16)(concatenated)
          activation_layer1 = Activation('relu')(dense_layer1)
          dropout_layer1 = Dropout(0.2)(activation_layer1)
          
          output_layer = Dense(1, activation='sigmoid')(concatenated)
      ​
          model = keras.Model(inputs=[text_input, additional_tweet_input, user_data_input], outputs=output_layer)
          model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
      ​
          return model
      x
      ### batch_size=250, epochs=400
      x
      #### Create and train model
      [234]:
      x
      model_name = 'model_tweets_data_based_10000_1_v1_batch_size_250_20_latest_tweets_of_user_padded_tweets_text_and_additional_tweets_features_and_user_data'
      ​
      model = create_model_1(num_words=num_words, embedding_dim=embedding_dim, 
                             embedding_matrix=embedding_matrix, max_sequence_length=max_length, 
                             trainable=False, 
                             tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape,
                             add_tweets_feat_shape=np.array(compact_train_tweets_add_feat_data_padded[0]).shape,
                             user_data_shape=train_users_data.shape[1],
                             optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
      [235]:
      xxxxxxxxxx
      # train_users_data = tf.convert_to_tensor(train_users_data, dtype=tf.float32)
      # val_users_data = tf.convert_to_tensor(val_users_data, dtype=tf.float32)
      # test_users_data = tf.convert_to_tensor(test_users_data, dtype=tf.float32)
      [236]:
      x
      model =  train_model(model, 
                            model_name, 
                            train_X=[compact_train_tweets_text_data_padded,compact_train_tweets_add_feat_data_padded, p1],
                            train_Y=train_users_data_Y,
                            val_X=[compact_val_tweets_text_data_padded, compact_val_tweets_add_feat_data_padded, p2],
                            val_Y=val_users_data_Y,
                            batch_size=250, 
                            epochs=400,
                            patience=100)
      accuracy
      	training         	 (min:    0.492, max:    0.551, cur:    0.521)
      	validation       	 (min:    0.498, max:    0.613, cur:    0.536)
      Loss
      	training         	 (min: 12280035016704.000, max: 16165169201676288.000, cur: 51037809410048.000)
      	validation       	 (min: 24513566720.000, max: 11385774691844096.000, cur: 67552852049920.000)
      
      Epoch 175: val_accuracy did not improve from 0.61306
      28/28 [==============================] - 2s 82ms/step - loss: 51037809410048.0000 - accuracy: 0.5209 - val_loss: 67552852049920.0000 - val_accuracy: 0.5363
      
      x
      #### Prediction and results
      [237]:
      x
      prediction_and_metrics(model, [compact_test_tweets_text_data_padded, compact_test_tweets_add_feat_data_padded, test_users_data], test_users_data_Y)
      Accuracy: 0.5767123287671233
      Precision: [0.57179487 0.58235294]
      Recall: 0.5424657534246575
      F1 score: 0.561702
      ROC AUC: 0.576712
      
      <Figure size 640x480 with 0 Axes>
      x
      #### Prediction and results on training set
      [238]:
      x
      prediction_and_metrics(model, [compact_train_tweets_text_data_padded, compact_train_tweets_add_feat_data_padded, train_users_data], train_users_data_Y)
      Accuracy: 0.6012093291102792
      Precision: [0.5987637  0.60377916]
      Recall: 0.5888281025050389
      F1 score: 0.596210
      ROC AUC: 0.601209
      
      <Figure size 640x480 with 0 Axes>
      x
      ## Model 2. (tweets text and addtitional tweets features and user data)
      xxxxxxxxxx

      Create model¶

      [239]:
      x
      def create_model_2(num_words, embedding_dim, embedding_matrix, max_sequence_length, trainable,
                         tweets_text_shape, add_tweets_feat_shape, user_data_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)):
          
          
          # Additional tweet's features input
          user_data_input = Input(shape=user_data_shape) 
      ​
          # ---------------------------------------------------------------------
          
          # Additional tweet's features input
          additional_tweet_input = Input(shape=add_tweets_feat_shape) 
          masked_input_add_tweets_feat = Masking(mask_value=0.0)(additional_tweet_input)
      ​
          lstm_layer_2 = LSTM(16)(masked_input_add_tweets_feat)
          lstm_dropout_layer1 = Dropout(0.5)(lstm_layer_2)
          
          # ---------------------------------------------------------------------
          
           # Tweets text input
          text_input = Input(shape=tweets_text_shape)
          # Embedding layer for text
          embedding_layer = Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=max_sequence_length, trainable=trainable)(text_input)
          masked_input_text = Masking(mask_value=0.0)(embedding_layer)
          reshaped = Reshape((tweets_text_shape[0], tweets_text_shape[1]*embedding_dim))(masked_input_text)
          
          cnn_layer1 = Conv1D(filters=16, kernel_size=3, activation='relu')(reshaped)
          cnn_dropout_layer1 = Dropout(0.5)(cnn_layer1)
          # pooling_layer1 = MaxPooling1D(pool_size=2, strides=1, padding='valid')(cnn_layer1)
          cnn_flatten_layer1 = Flatten()(cnn_layer1)
          cnn_dropout_layer2 = Dropout(0.5)(cnn_flatten_layer1)
          
          # ---------------------------------------------------------------------
          
          # Concatenate text and additional features
          concatenated = concatenate([lstm_dropout_layer1, cnn_dropout_layer2, user_data_input])
          
          # dense_layer1 = Dense(16)(concatenated)
          # activation_layer1 = Activation('relu')(dense_layer1)
          dropout_layer1 = Dropout(0.3)(concatenated)
          
          output_layer = Dense(1, activation='sigmoid')(concatenated)
      ​
          model = keras.Model(inputs=[text_input, additional_tweet_input, user_data_input], outputs=output_layer)
          model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
      ​
          return model
      xxxxxxxxxx

      batch_size=250, epochs=400¶

      xxxxxxxxxx

      Create and train model¶

      [240]:
      x
      model_name = 'model_tweets_data_based_10000_2_v1_batch_size_250_20_latest_tweets_of_user_padded_tweets_text_and_additional_tweets_features_and_user_data'
      ​
      model = create_model_2(num_words=num_words, embedding_dim=embedding_dim, 
                             embedding_matrix=embedding_matrix, max_sequence_length=max_length, 
                             trainable=False, 
                             tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape,
                             add_tweets_feat_shape=np.array(compact_train_tweets_add_feat_data_padded[0]).shape,
                             user_data_shape=train_users_data.shape[1],
                             optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
      [241]:
       
      train_users_data = tf.convert_to_tensor(train_users_data, dtype=tf.float32)
      val_users_data = tf.convert_to_tensor(val_users_data, dtype=tf.float32)
      test_users_data = tf.convert_to_tensor(test_users_data, dtype=tf.float32)
      [242]:
       
      model =  train_model(model, 
                            model_name, 
                            train_X=[compact_train_tweets_text_data_padded,compact_train_tweets_add_feat_data_padded, p1],
                            train_Y=train_users_data_Y,
                            val_X=[compact_val_tweets_text_data_padded, compact_val_tweets_add_feat_data_padded, p2],
                            val_Y=val_users_data_Y,
                            batch_size=250, 
                            epochs=400,
                            patience=100)
      accuracy
      	training         	 (min:    0.487, max:    0.550, cur:    0.525)
      	validation       	 (min:    0.497, max:    0.617, cur:    0.539)
      Loss
      	training         	 (min: 12365196165120.000, max: 28899256534302720.000, cur: 36342876602368.000)
      	validation       	 (min: 341581594624.000, max: 24356484657709056.000, cur: 43420508749824.000)
      
      Epoch 283: val_accuracy did not improve from 0.61709
      28/28 [==============================] - 2s 76ms/step - loss: 36342876602368.0000 - accuracy: 0.5255 - val_loss: 43420508749824.0000 - val_accuracy: 0.5390
      
      xxxxxxxxxx

      Prediction and results¶

      [243]:
       
      prediction_and_metrics(model, [compact_test_tweets_text_data_padded, compact_test_tweets_add_feat_data_padded, test_users_data], test_users_data_Y)
      Accuracy: 0.5904109589041096
      Precision: [0.60060976 0.58208955]
      Recall: 0.6410958904109589
      F1 score: 0.610169
      ROC AUC: 0.590411
      
      <Figure size 640x480 with 0 Axes>
      xxxxxxxxxx

      Prediction and results on training set¶

      [244]:
       
      prediction_and_metrics(model, [compact_train_tweets_text_data_padded, compact_train_tweets_add_feat_data_padded, train_users_data], train_users_data_Y)
      Accuracy: 0.6020731356176217
      Precision: [0.61535958 0.59153111]
      Recall: 0.659660236107112
      F1 score: 0.623741
      ROC AUC: 0.602073
      
      <Figure size 640x480 with 0 Axes>
      [ ]:
       
      ​
      Activity log
      [244]:
      xxxxxxxxxx
      prediction_and_metrics(model, [compact_train_tweets_text_data_padded, compact_train_tweets_add_feat_data_padded, train_users_data], train_users_data_Y)
      Advanced Tools
      xxxxxxxxxx
      xxxxxxxxxx

      -

      Variables

      Callstack

        Breakpoints

        Source

        xxxxxxxxxx
        1

        Kernel Sources

          0
          1
          master
          No Kernel | Unknown
          Saving failed
          Uploading…

          users_and_tweets_data_based_10000_model_1_with_standardization_and_all_tweets_of_user.ipynb
          Spaces: 4
          Ln 1, Col 1
          Mode: Command
          • Notebook details
          • Send feedback
          • Documentation
          • Console
          • Change Kernel…
          • Clear Console Cells
          • Close and Shut Down…
          • Insert Line Break
          • Interrupt Kernel
          • New Console
          • Restart Kernel…
          • Run Cell (forced)
          • Run Cell (unforced)
          • Show All Kernel Activity
          • Debugger
          • Continue
            Continue
            F9
          • Enable / Disable pausing on exceptions
            Enable / Disable pausing on exceptions
          • Evaluate Code
            Evaluate Code
          • Next
            Next
            F10
          • Step In
            Step In
            F11
          • Step Out
            Step Out
            Shift+F11
          • Terminate
            Terminate
            Shift+F9
          • File Operations
          • Autosave Documents
          • Download
            Download the file to your computer
          • Open from Path…
            Open from path
          • Open from URL…
            Open from URL
          • Reload Notebook from Disk
            Reload contents from disk
          • Revert Notebook to Checkpoint
            Revert contents to previous checkpoint
          • Save Notebook
            Save and create checkpoint
            Ctrl+S
          • Save Notebook As…
            Save with new path
            Ctrl+Shift+S
          • Show Active File in File Browser
          • Trust HTML File
            Whether the HTML file is trusted. Trusting the file allows scripts to run in it, which may result in security risks. Only enable for files you trust.
          • Help
          • About JupyterLab
          • Jupyter Forum
          • Jupyter Reference
          • JupyterLab FAQ
          • JupyterLab Reference
          • Launch Classic Notebook
          • Licenses
          • Markdown Reference
          • Reset Application State
          • Image Viewer
          • Flip image horizontally
            H
          • Flip image vertically
            V
          • Invert Colors
            I
          • Reset Image
            0
          • Rotate Clockwise
            ]
          • Rotate Counterclockwise
            [
          • Zoom In
            =
          • Zoom Out
            -
          • Jupytext
          • Jupytext Reference
          • Jupytext FAQ
          • Pair Notebook with ipynb document
          • Pair Notebook with light Script
          • Pair Notebook with percent Script
          • Pair Notebook with Hydrogen Script
          • Pair Notebook with nomarker Script
          • Pair Notebook with Markdown
          • Pair Notebook with MyST Markdown
          • Pair Notebook with R Markdown
          • Pair Notebook with Quarto (qmd)
          • Custom pairing
          • Unpair Notebook
          • Include Metadata
          • Kernel Operations
          • Shut Down All Kernels…
          • Launcher
          • New Launcher
          • Main Area
          • Activate Next Tab
            Ctrl+Shift+]
          • Activate Next Tab Bar
            Ctrl+Shift+.
          • Activate Previous Tab
            Ctrl+Shift+[
          • Activate Previous Tab Bar
            Ctrl+Shift+,
          • Activate Previously Used Tab
            Ctrl+Shift+'
          • Close All Other Tabs
          • Close All Tabs
          • Close Tab
            Alt+W
          • Close Tabs to Right
          • Find Next
            Ctrl+G
          • Find Previous
            Ctrl+Shift+G
          • Find…
            Ctrl+F
          • Log Out
            Log out of JupyterLab
          • Presentation Mode
          • Show Header Above Content
          • Show Left Sidebar
            Ctrl+B
          • Show Log Console
          • Show Right Sidebar
          • Show Status Bar
          • Shut Down
            Shut down JupyterLab
          • Simple Interface
            Ctrl+Shift+D
          • notebook
          • Open BigQuery SQL in-cell editor
          • Notebook Cell Operations
          • Change to Code Cell Type
            Y
          • Change to Heading 1
            1
          • Change to Heading 2
            2
          • Change to Heading 3
            3
          • Change to Heading 4
            4
          • Change to Heading 5
            5
          • Change to Heading 6
            6
          • Change to Markdown Cell Type
            M
          • Change to Raw Cell Type
            R
          • Clear Outputs
          • Collapse All Code
          • Collapse All Outputs
          • Collapse Selected Code
          • Collapse Selected Outputs
          • Copy Cells
            Copy the selected cells
            C
          • Cut Cells
            Cut the selected cells
            X
          • Delete Cells
            D, D
          • Disable Scrolling for Outputs
          • Enable Scrolling for Outputs
          • Expand All Code
          • Expand All Outputs
          • Expand Selected Code
          • Expand Selected Outputs
          • Extend Selection Above
            Shift+K
          • Extend Selection Below
            Shift+J
          • Extend Selection to Bottom
            Shift+End
          • Extend Selection to Top
            Shift+Home
          • Insert Cell Above
            A
          • Insert Cell Below
            Insert a cell below
            B
          • Merge Cell Above
            Ctrl+Backspace
          • Merge Cell Below
            Ctrl+Shift+M
          • Merge Selected Cells
            Shift+M
          • Move Cells Down
          • Move Cells Up
          • Paste Cells Above
          • Paste Cells and Replace
          • Paste Cells Below
            Paste cells from the clipboard
            V
          • Redo Cell Operation
            Shift+Z
          • Render Side-by-Side
            Shift+R
          • Run Selected Cells
            Shift+Enter
          • Run Selected Cells and Don't Advance
            Ctrl+Enter
          • Run Selected Cells and Insert Below
            Alt+Enter
          • Run Selected Text or Current Line in Console
          • Select Cell Above
            K
          • Select Cell Below
            J
          • Set side-by-side ratio
          • Split Cell
            Ctrl+Shift+-
          • Undo Cell Operation
            Z
          • Notebook Operations
          • Change Kernel…
          • Clear All Outputs
          • Close and Shut Down
          • Collapse All Cells
          • Deselect All Cells
          • Enter Command Mode
            Ctrl+M
          • Enter Edit Mode
            Enter
          • Expand All Headings
          • Interrupt Kernel
          • New Console for Notebook
          • New Notebook
            Create a new notebook
          • Reconnect To Kernel
          • Render All Markdown Cells
          • Restart Kernel and Clear All Outputs…
          • Restart Kernel and Debug…
            Restart Kernel and Debug…
          • Restart Kernel and Run All Cells…
          • Restart Kernel and Run up to Selected Cell…
          • Restart Kernel…
          • Run All Above Selected Cell
          • Run All Cells
          • Run Selected Cell and All Below
          • Save and Export Notebook: Asciidoc
          • Save and Export Notebook: Executable Script
          • Save and Export Notebook: HTML
          • Save and Export Notebook: LaTeX
          • Save and Export Notebook: Markdown
          • Save and Export Notebook: PDF
          • Save and Export Notebook: Qtpdf
          • Save and Export Notebook: Qtpng
          • Save and Export Notebook: ReStructured Text
          • Save and Export Notebook: Reveal.js Slides
          • Save and Export Notebook: Webpdf
          • Select All Cells
            Ctrl+A
          • Toggle All Line Numbers
            Shift+L
          • Toggle Collapse Notebook Heading
          • Trust Notebook
          • Settings
          • Advanced JSON Settings Editor
          • Advanced Settings Editor
          • Show Contextual Help
          • Show Contextual Help
            Live updating code documentation from the active kernel
            Ctrl+I
          • Terminal
          • Decrease Terminal Font Size
          • Increase Terminal Font Size
          • New Terminal
            Start a new terminal session
          • Refresh Terminal
            Refresh the current terminal session
          • Use Terminal Theme: Dark
            Set the terminal theme
          • Use Terminal Theme: Inherit
            Set the terminal theme
          • Use Terminal Theme: Light
            Set the terminal theme
          • Text Editor
          • Decrease Font Size
          • Increase Font Size
          • Indent with Tab
          • New Markdown File
            Create a new markdown file
          • New Python File
            Create a new Python file
          • New R File
            Create a new R file
          • New Text File
            Create a new text file
          • Spaces: 1
          • Spaces: 2
          • Spaces: 4
          • Spaces: 8
          • Theme
          • Decrease Code Font Size
          • Decrease Content Font Size
          • Decrease UI Font Size
          • Increase Code Font Size
          • Increase Content Font Size
          • Increase UI Font Size
          • Theme Scrollbars
          • Use Theme: JupyterLab Light